Next: Using Tree-sitter Parser, Up: Parsing Program Source [Contents][Index]
Tree-sitter relies on language definitions to parse text in that
language. In Emacs, A language definition is represented by a symbol
tree-sitter-<language>
. For example, C language definition is
represented as tree-sitter-c
, and tree-sitter-c
can be
passed to tree-sitter functions as the language argument.
Tree-sitter language definitions are distributed as dynamic
libraries. In order to use a language definition in Emacs, you need to
make sure that the dynamic library is installed on the system, either
in standard locations or in LD_LIBRARY_PATH
(on some systems,
it is DYLD_LIBRARY_PATH
). If Emacs cannot find the library or
has problem loading it, Emacs signals
tree-sitter-load-language-error. The signal data is a list of
specific error messages.
This function checks whether the dynamic library for language is present on the system, and return non-nil if it is.
By convention, the dynamic library for tree-sitter-<language>
is libtree-sitter-<language>.ext
, where ext is the
system-specific extension for dynamic libraries. Also by convention,
the function provided by that library is named
tree_sitter_<language>
. If a language definition doesn’t
follow this convention, you should add an entry
(language-symbol library-base-name function-name)
to tree-sitter-load-name-override-list, where
library-base-name is the base filename for the dynamic library
(conventionally libtree-sitter-<language>
), and
function-name is the function provided by the library
(conventionally tree_sitter_<language>
). For example,
(tree-sitter-cool-lang "libtree-sitter-coool" "tree_sitter_coool")
for a language too cool to abide by the rules.
A syntax tree is what a language definition defines (more or less) and what a parser generates. In a syntax tree, each node represents a piece of text, and is connected to each other by a parent-child relationship. For example, if the source text is
1 + 2
its syntax tree could be
+--------------+ | root "1 + 2" | +--------------+ | +--------------------------------+ | expression "1 + 2" | +--------------------------------+ | | | +------------+ +--------------+ +------------+ | number "1" | | operator "+" | | number "2" | +------------+ +--------------+ +------------+
We can also represent it in s-expression:
(root (expression (number) (operator) (number)))
Names like root
, expression
, number
,
operator
are nodes’ type. However, not all nodes in a
syntax tree have a type. Nodes that don’t are anonymous nodes,
and nodes with a type are named nodes. Anonymous nodes are
tokens with fixed spellings, including punctuation characters like
bracket ‘]’, and keywords like return
.
To make the syntax tree easier to
analyze, many language definitions assign field names to child
nodes. For example, a function_definition
node could have a
declarator
and a body
:
(function_definition declarator: (declaration) body: (compound_statement))
This minor mode displays the node that starts at point in mode-line. The mode-line will display
parent field-name: (child (grand-child (...)))
child, grand-child, and grand-grand-child, etc, are nodes that have their beginning at point. And parent is the parent of child.
If there is no node that starts at point, i.e., point is in the middle of a node, then the mode-line only displays the smallest node that spans point, and its immediate parent.
This minor mode doesn’t create parsers on its own. It simply uses the first parser in tree-sitter-parser-list (see Using Tree-sitter Parser).
Authors of language definitions define the grammar of a language, and this grammar determines how does a parser construct a concrete syntax tree out of the text. In order to used the syntax tree effectively, we need to read the grammar file.
The grammar file is usually grammar.js
in a language
definition’s project repository. The link to a language definition’s
home page can be found in tree-sitter’s homepage
(https://tree-sitter.github.io/tree-sitter).
The grammar is written in JavaScript syntax. For example, the rule
matching a function_definition
node looks like
function_definition: $ => seq( $.declaration_specifiers, field('declarator', $.declaration), field('body', $.compound_statement) )
The rule is represented by a function that takes a single argument
$, representing the whole grammar. The function itself is
constructed by other functions: the seq
function puts together a
sequence of children; the field
function annotates a child with
a field name. If we write the above definition in BNF syntax, it
would look like
function_definition := <declaration_specifiers> <declaration> <compound_statement>
and the node returned by the parser would look like
(function_definition (declaration_specifier) declarator: (declaration) body: (compound_statement))
Below is a list of functions that one will see in a grammar definition. Each function takes other rules as arguments and returns a new rule.
seq(rule1, rule2, ...)
matches each rule one after another.
choice(rule1, rule2, ...)
matches one of the rules in its
arguments.
repeat(rule)
matches rule for zero or more times.
This is like the ‘*’ operator in regular expressions.
repeat1(rule)
matches rule for one or more times.
This is like the ‘+’ operator in regular expressions.
optional(rule)
matches rule for zero or one time.
This is like the ‘?’ operator in regular expressions.
field(name, rule)
assigns field name name to the child
node matched by rule.
alias(rule, alias)
makes nodes matched by rule appear as
alias in the syntax tree generated by the parser. For example,
alias(preprocessor_call_exp, call_expression)
makes any node matched by preprocessor_call_exp
to appear as
call_expression
.
Below are grammar functions less interesting for a reader of a language definition.
token(rule)
marks rule to produce a single leaf node.
That is, instead of generating a parent node with individual child
nodes under it, everything is combined into a single leaf node.
token.immediate(rule)
changes rule to match only when
there is no preceding whitespaces.
prec(n, rule)
gives rule a level n precedence.
prec.left([n,] rule)
marks rule as left-associative,
optionally with level n.
prec.right([n,] rule)
marks rule as right-associative,
optionally with level n.
prec.dynamic(n, rule)
is like prec
, but the precedence
is applied at runtime instead.
The tree-sitter project talks about writing a grammar in more detail: https://tree-sitter.github.io/tree-sitter/creating-parsers. Read especially “The Grammar DSL” section.
Next: Using Tree-sitter Parser, Up: Parsing Program Source [Contents][Index]