Next: , Previous: , Up: Parsing Program Source   [Contents][Index]


37.6 Parsing Text in Multiple Languages

Sometimes, the source of a programming language could contain sources of other languages, HTML + CSS + JavaScript is one example. In that case, we need to assign individual parsers to text segments written in different languages. Traditionally this is achieved by using narrowing. While tree-sitter works with narrowing (see narrowing), the recommended way is to set ranges in which a parser will operate.

Function: tree-sitter-parser-set-included-ranges parser ranges

This function sets the range of parser to ranges. Then parser will only read the text covered in each range. Each range in ranges is a list of cons (beg . end).

Each range in ranges must come in order and not overlap. That is, in pseudo code:

(cl-loop for idx from 1 to (1- (length ranges))
         for prev = (nth (1- idx) ranges)
         for next = (nth idx ranges)
         should (<= (car prev) (cdr prev)
                    (car next) (cdr next)))

If ranges violates this constraint, or something else went wrong, this function signals a tree-sitter-range-invalid. The signal data contains a specific error message and the ranges we are trying to set.

This function can also be used for disabling ranges. If ranges is nil, the parser is set to parse the whole buffer.

Example:

(tree-sitter-parser-set-included-ranges
 parser '((1 . 9) (16 . 24) (24 . 25)))
Function: tree-sitter-parser-included-ranges parser

This function returns the ranges set for parser. The return value is the same as the ranges argument of tree-sitter-parser-included-ranges: a list of cons (beg . end). And if parser doesn’t have any ranges, the return value is nil.

(tree-sitter-parser-included-ranges parser)
    ⇒ ((1 . 9) (16 . 24) (24 . 25))
Function: tree-sitter-set-ranges parser-or-lang ranges

Like tree-sitter-parser-set-included-ranges, this function sets the ranges of parser-or-lang to ranges. Conveniently, parser-or-lang could be either a parser or a language. If it is a language, this function looks for the first parser in tree-sitter-parser-list for that language in the current buffer, and set range for it.

Function: tree-sitter-get-ranges parser-or-lang

This function returns the ranges of parser-or-lang, like tree-sitter-parser-included-ranges. And like tree-sitter-set-ranges, parser-or-lang can be a parser or a language symbol.

Function: tree-sitter-query-range source pattern &optional beg end

This function matches source with pattern and returns the ranges of captured nodes. The return value has the same shape of other functions: a list of (beg . end).

For convenience, source can be a language symbol, a parser, or a node. If a language symbol, this function matches in the root node of the first parser using that language; if a parser, this function matches in the root node of that parser; if a node, this function matches in that node.

Parameter pattern is the query pattern used to capture nodes (see Pattern Matching Tree-sitter Nodes). The capture names don’t matter. Parameter beg and end, if both non-nil, limits the range in which this function queries.

Like other query functions, this function raises an tree-sitter-query-error if pattern is malformed.

Function: tree-sitter-language-at point

This function tries to figure out which language is responsible for the text at point. It goes over each parser in tree-sitter-parser-list and see if that parser’s range covers point.

Variable: tree-sitter-range-functions

An alist of (language . function). Font-locking and indenting code uses functions in this alist to set correct ranges for a language parser before using it.

language is a language symbol, function is a function that sets ranges for language. It’s signature should be

(start end &rest _)

where start and end marks the region that is about to be used. function only need to (but not limited to) update ranges in that region.

Function: tree-sitter-update-ranges &optional start end

This function is used by font-lock and indent to update ranges before using any parser. Each range function in tree-sitter-range-functions is called in-order. Arguments start and end are passed to each range function.

An example

Normally, in a set of languages that can be mixed together, there is a major language and several embedded languages. The major language parses the whole document, and skips the embedded languages. Then the parser for the major language knows the ranges of the embedded languages. So we first parse the whole document with the major languageā€™s parser, set ranges for the embedded languages, then parse the embedded languages.

Suppose we want to parse a very simple document that mixes HTML, CSS and JavaScript:

<html>
  <script>1 + 2</script>
  <style>body { color: "blue"; }</style>
</html>

We first parse with HTML, then set ranges for CSS and JavaScript:

;; Create parsers.
(setq html (tree-sitter-get-parser-create 'tree-sitter-html))
(setq css (tree-sitter-get-parser-create 'tree-sitter-css))
(setq js (tree-sitter-get-parser-create 'tree-sitter-javascript))

;; Set CSS ranges.
(setq css-range
      (tree-sitter-query-range
       'tree-sitter-html
       "(style_element (raw_text) @capture)"))
(tree-sitter-parser-set-included-ranges css css-range)

;; Set JavaScript ranges.
(setq js-range
      (tree-sitter-query-range
       'tree-sitter-html
       "(script_element (raw_text) @capture)"))
(tree-sitter-parser-set-included-ranges js js-range)

We use a query pattern (style_element (raw_text) @capture) to find CSS nodes in the HTML parse tree. For how to write query patterns, see Pattern Matching Tree-sitter Nodes.


Next: Tree-sitter C API Correspondence, Previous: Pattern Matching Tree-sitter Nodes, Up: Parsing Program Source   [Contents][Index]