Next: , Previous: , Up: Parsing Program Source   [Contents][Index]


37.6 Parsing Text in Multiple Languages

Sometimes, the source of a programming language could contain snippets of other languages; HTML + CSS + JavaScript is one example. In that case, text segments written in different languages need to be assigned different parsers. Traditionally, this is achieved by using narrowing. While tree-sitter works with narrowing (see narrowing), the recommended way is instead to set regions of buffer text (i.e., ranges) in which a parser will operate. This section describes functions for setting and getting ranges for a parser.

Lisp programs should call treesit-update-ranges to make sure the ranges for each parser are correct before using parsers in a buffer, and call treesit-language-at to figure out the language responsible for the text at some position. These two functions don’t work by themselves, they need major modes to set treesit-range-settings and treesit-language-at-point-function, which do the actual work. These functions and variables are explained in more detail towards the end of the section.

Getting and setting ranges

Function: treesit-parser-set-included-ranges parser ranges

This function sets up parser to operate on ranges. The parser will only read the text of the specified ranges. Each range in ranges is a list of the form (beg . end).

The ranges in ranges must come in order and must not overlap. That is, in pseudo code:

(cl-loop for idx from 1 to (1- (length ranges))
         for prev = (nth (1- idx) ranges)
         for next = (nth idx ranges)
         should (<= (car prev) (cdr prev)
                    (car next) (cdr next)))

If ranges violates this constraint, or something else went wrong, this function signals the treesit-range-invalid error. The signal data contains a specific error message and the ranges we are trying to set.

This function can also be used for disabling ranges. If ranges is nil, the parser is set to parse the whole buffer.

Example:

(treesit-parser-set-included-ranges
 parser '((1 . 9) (16 . 24) (24 . 25)))
Function: treesit-parser-included-ranges parser

This function returns the ranges set for parser. The return value is the same as the ranges argument of treesit-parser-included-ranges: a list of cons cells of the form (beg . end). If parser doesn’t have any ranges, the return value is nil.

(treesit-parser-included-ranges parser)
    ⇒ ((1 . 9) (16 . 24) (24 . 25))
Function: treesit-query-range source query &optional beg end

This function matches source with query and returns the ranges of captured nodes. The return value is a list of cons cells of the form (beg . end), where beg and end specify the beginning and the end of a region of text.

For convenience, source can be a language symbol, a parser, or a node. If it’s a language symbol, this function matches in the root node of the first parser using that language; if a parser, this function matches in the root node of that parser; if a node, this function matches in that node.

The argument query is the query used to capture nodes (see Pattern Matching Tree-sitter Nodes). The capture names don’t matter. The arguments beg and end, if both non-nil, limit the range in which this function queries.

Like other query functions, this function raises the treesit-query-error error if query is malformed.

Supporting multiple languages in Lisp programs

It should suffice for general Lisp programs to call the following two functions in order to support program sources that mixes multiple languages.

Function: treesit-update-ranges &optional beg end

This function updates ranges for parsers in the buffer. It makes sure the parsers’ ranges are set correctly between beg and end, according to treesit-range-settings. If omitted, beg defaults to the beginning of the buffer, and end defaults to the end of the buffer.

For example, fontification functions use this function before querying for nodes in a region.

Function: treesit-language-at pos

This function returns the language of the text at buffer position pos. Under the hood it calls treesit-language-at-point-function and returns its return value. If treesit-language-at-point-function is nil, this function returns the language of the first parser in the returned value of treesit-parser-list. If there is no parser in the buffer, it returns nil.

Supporting multiple languages in major modes

Normally, in a set of languages that can be mixed together, there is a host language and one or more embedded languages. A Lisp program usually first parses the whole document with the host language’s parser, retrieves some information, sets ranges for the embedded languages with that information, and then parses the embedded languages.

Take a buffer containing HTML, CSS and JavaScript as an example. A Lisp program will first parse the whole buffer with an HTML parser, then query the parser for style_element and script_element nodes, which correspond to CSS and JavaScript text, respectively. Then it sets the range of the CSS and JavaScript parser to the ranges in which their corresponding nodes span.

Given a simple HTML document:

<html>
  <script>1 + 2</script>
  <style>body { color: "blue"; }</style>
</html>

a Lisp program will first parse with a HTML parser, then set ranges for CSS and JavaScript parsers:

;; Create parsers.
(setq html (treesit-get-parser-create 'html))
(setq css (treesit-get-parser-create 'css))
(setq js (treesit-get-parser-create 'javascript))
;; Set CSS ranges.
(setq css-range
      (treesit-query-range
       'html
       "(style_element (raw_text) @capture)"))
(treesit-parser-set-included-ranges css css-range)
;; Set JavaScript ranges.
(setq js-range
      (treesit-query-range
       'html
       "(script_element (raw_text) @capture)"))
(treesit-parser-set-included-ranges js js-range)

Emacs automates this process in treesit-update-ranges. A multi-language major mode should set treesit-range-settings so that treesit-update-ranges knows how to perform this process automatically. Major modes should use the helper function treesit-range-rules to generate a value that can be assigned to treesit-range-settings. The settings in the following example directly translate into operations shown above.

(setq-local treesit-range-settings
            (treesit-range-rules
             :embed 'javascript
             :host 'html
             '((script_element (raw_text) @capture))
             :embed 'css
             :host 'html
             '((style_element (raw_text) @capture))))
Function: treesit-range-rules &rest query-specs

This function is used to set treesit-range-settings. It takes care of compiling queries and other post-processing, and outputs a value that treesit-range-settings can have.

It takes a series of query-specs, where each query-spec is a query preceded by zero or more keyword/value pairs. Each query is a tree-sitter query in either the string, s-expression or compiled form, or a function.

If query is a tree-sitter query, it should be preceded by two :keyword/value pairs, where the :embed keyword specifies the embedded language, and the :host keyword specified the host language.

treesit-update-ranges uses query to figure out how to set the ranges for parsers for the embedded language. It queries query in a host language parser, computes the ranges in which the captured nodes span, and applies these ranges to embedded language parsers.

If query is a function, it doesn’t need any :keyword and value pair. It should be a function that takes 2 arguments, start and end, and sets the ranges for parsers in the current buffer in the region between start and end. It is fine for this function to set ranges in a larger region that encompasses the region between start and end.

Variable: treesit-range-settings

This variable helps treesit-update-ranges in updating the ranges for parsers in the buffer. It is a list of settings where the exact format of a setting is considered internal. You should use treesit-range-rules to generate a value that this variable can have.

Variable: treesit-language-at-point-function

This variable’s value should be a function that takes a single argument, pos, which is a buffer position, and returns the language of the buffer text at pos. This variable is used by treesit-language-at.


Next: Developing major modes with tree-sitter, Previous: Pattern Matching Tree-sitter Nodes, Up: Parsing Program Source   [Contents][Index]