Language Definitions (GNU Emacs Lisp Reference Manual)

37.1 Tree-sitter Language Definitions

Loading a language definition

Tree-sitter relies on language definitions to parse text in that language. In Emacs, A language definition is represented by a symbol tree-sitter-<language>. For example, C language definition is represented as tree-sitter-c, and tree-sitter-c can be passed to tree-sitter functions as the language argument.

Tree-sitter language definitions are distributed as dynamic libraries. In order to use a language definition in Emacs, you need to make sure that the dynamic library is installed on the system, either in standard locations or in LD_LIBRARY_PATH (on some systems, it is DYLD_LIBRARY_PATH). If Emacs cannot find the library or has problem loading it, Emacs signals tree-sitter-load-language-error. The signal data is a list of specific error messages.

Function: tree-sitter-language-available-p language ¶: This function checks whether the dynamic library for language is present on the system, and return non-nil if it is.

By convention, the dynamic library for tree-sitter-<language> is libtree-sitter-<language>.ext, where ext is the system-specific extension for dynamic libraries. Also by convention, the function provided by that library is named tree_sitter_<language>. If a language definition doesn’t follow this convention, you should add an entry

(language-symbol library-base-name function-name)

to tree-sitter-load-name-override-list, where library-base-name is the base filename for the dynamic library (conventionally libtree-sitter-<language>), and function-name is the function provided by the library (conventionally tree_sitter_<language>). For example,

(tree-sitter-cool-lang "libtree-sitter-coool" "tree_sitter_coool")

for a language too cool to abide by the rules.

Concrete syntax tree

A syntax tree is what a language definition defines (more or less) and what a parser generates. In a syntax tree, each node represents a piece of text, and is connected to each other by a parent-child relationship. For example, if the source text is

1 + 2

its syntax tree could be

                  +--------------+
                  | root "1 + 2" |
                  +--------------+
                         |
        +--------------------------------+
        |       expression "1 + 2"       |
        +--------------------------------+
           |             |            |
+------------+   +--------------+   +------------+
| number "1" |   | operator "+" |   | number "2" |
+------------+   +--------------+   +------------+

We can also represent it in s-expression:

(root (expression (number) (operator) (number)))

Node types

Names like root, expression, number, operator are nodes’ type. However, not all nodes in a syntax tree have a type. Nodes that don’t are anonymous nodes, and nodes with a type are named nodes. Anonymous nodes are tokens with fixed spellings, including punctuation characters like bracket ‘]’, and keywords like return.

Field names

To make the syntax tree easier to analyze, many language definitions assign field names to child nodes. For example, a function_definition node could have a declarator and a body:

(function_definition
 declarator: (declaration)
 body: (compound_statement))

Command: tree-sitter-inspect-mode ¶

This minor mode displays the node that starts at point in mode-line. The mode-line will display

parent field-name: (child (grand-child (...)))

child, grand-child, and grand-grand-child, etc, are nodes that have their beginning at point. And parent is the parent of child.

If there is no node that starts at point, i.e., point is in the middle of a node, then the mode-line only displays the smallest node that spans point, and its immediate parent.

This minor mode doesn’t create parsers on its own. It simply uses the first parser in tree-sitter-parser-list (see Using Tree-sitter Parser).

Reading the grammar definition

Authors of language definitions define the grammar of a language, and this grammar determines how does a parser construct a concrete syntax tree out of the text. In order to used the syntax tree effectively, we need to read the grammar file.

The grammar file is usually grammar.js in a language definition’s project repository. The link to a language definition’s home page can be found in tree-sitter’s homepage (https://tree-sitter.github.io/tree-sitter).

The grammar is written in JavaScript syntax. For example, the rule matching a function_definition node looks like

function_definition: $ => seq(
  $.declaration_specifiers,
  field('declarator', $.declaration),
  field('body', $.compound_statement)
)

The rule is represented by a function that takes a single argument $, representing the whole grammar. The function itself is constructed by other functions: the seq function puts together a sequence of children; the field function annotates a child with a field name. If we write the above definition in BNF syntax, it would look like

function_definition :=
  <declaration_specifiers> <declaration> <compound_statement>

and the node returned by the parser would look like

(function_definition
  (declaration_specifier)
  declarator: (declaration)
  body: (compound_statement))

Below is a list of functions that one will see in a grammar definition. Each function takes other rules as arguments and returns a new rule.

seq(rule1, rule2, ...) matches each rule one after another.
choice(rule1, rule2, ...) matches one of the rules in its arguments.
repeat(rule) matches rule for zero or more times. This is like the ‘*’ operator in regular expressions.
repeat1(rule) matches rule for one or more times. This is like the ‘+’ operator in regular expressions.
optional(rule) matches rule for zero or one time. This is like the ‘?’ operator in regular expressions.
field(name, rule) assigns field name name to the child node matched by rule.
alias(rule, alias) makes nodes matched by rule appear as alias in the syntax tree generated by the parser. For example,
```
alias(preprocessor_call_exp, call_expression)
```
makes any node matched by preprocessor_call_exp to appear as call_expression.

Below are grammar functions less interesting for a reader of a language definition.

token(rule) marks rule to produce a single leaf node. That is, instead of generating a parent node with individual child nodes under it, everything is combined into a single leaf node.
Normally, grammar rules ignore preceding whitespaces, token.immediate(rule) changes rule to match only when there is no preceding whitespaces.
prec(n, rule) gives rule a level n precedence.
prec.left([n,] rule) marks rule as left-associative, optionally with level n.
prec.right([n,] rule) marks rule as right-associative, optionally with level n.
prec.dynamic(n, rule) is like prec, but the precedence is applied at runtime instead.

The tree-sitter project talks about writing a grammar in more detail: https://tree-sitter.github.io/tree-sitter/creating-parsers. Read especially “The Grammar DSL” section.