RFC: Emacs tree-sitter integration
Tree-sitter is a incremental parser that can provide a concrete syntax tree for the source code and is fast enough to parse on each key press. It has supported a wide range of languages, and support for more languages is on the way.
I’ve been working on a integration of tree-sitter library into Emacs’ core. The integration consists of two parts, first the direct translate of tree-sitter’s API, second the integration with Emacs’ font-lock and indent system. The first part is completed and is rather uncontentious. I’d appreciate comments on the second: Is the interface easy to understand? Is it easy to use? Is it flexible enough for every language?
Whether you are a major mode author or just a interested Emacs user, I invite you to try hacking with this tree-sitter integration—recreate existing major mode features (font-lock, indent), create new features (structured editing, etc)—and tell me how well it works. Better yet, provide some suggestions on improving the interface.
Building Emacs with tree-sitter support
Install tree-sittter
First, install libtree-sitter, either by a package manager, or from source:
git clone https://github.com/tree-sitter/tree-sitter.git cd tree-sitter make make install
This should install libtree-sitter in standard location.
Build Emacs
Then, build Emacs from my GitHub repository. Make sure you clone the
ts
branch.
git clone https://github.com/casouri/emacs.git --branch ts ./autogen.sh ./configure make
No need for special configure flags, tree-sitter is enabled automatically if libtree-sitter is present on the system. Now Emacs can be started by
src/emacs
Get language definitions
To use tree-sitter features in any meaningful way, we also need the language definition, eg, libtree-sitter-c for C. I wrote a script for automatically retrieving and compiling some of the libraries. The following commands
git clone https://github.com/casouri/tree-sitter-module.git cd tree-sitter-module ./batch-new.sh
should produce libraries for C, JSON, Go, HTML, JavaScript, CSS and
Python and store them in dist
directory. From there you can
copy these libraries to a standard path, or add that directory to
LD_LIBRARY_PATH
.
You can also find pre-built libraries in the release page: tree-sitter-module release v2.0.
Basic tree-sitter features
I suggest reading the tree-sitter node in the manual first, it covers how to create a parser, how to retrieve a node, how to pattern match nodes, and more. You can access the manual by typing
C-h i m elisp RET g Parsing Program Source RET
The command(s) above opens the Info reader, goes to Elisp Reference Manual, and opens the “Parsing Program Source” node, which contains manual for tree-sitter. Alternatively, you can read the tree-sitter node that I clipped from the HTML manuel.
Once you’ve read the manual, you can (require
'tree-sitter)
and hack away!
The manual only documents basic features of tree-sitter, leaving out font-lock and indent integration, because I expect the latter to change. They are instead documented below.
Font-lock interface
(From now on, I assume you have read the manual and I will use concepts introduced in the manual without explanation.)
If you are familiar with font-lock in Emacs, you know it is primarily
configured by font-lock-defaults
: major mode sets this
variable with language-specific configuration, font-lock takes that
variable and populate font-lock-keywords
, which directly
defines the pattern to fontify.
tree-sitter-font-lock-settings
Tree-sitter1
provides two analogues variables,
tree-sitter-font-lock-defaults
and
tree-sitter-font-lock-settings
.
tree-sitter-font-lock-settings
is a list of
SETTING
s where each SETTING
looks like
(LANGUAGE QUERY)
LANGUAGE
is the language this setting should use, and
QUERY
is either a string or a sexp query. Each capture name
in QUERY
is either a face name, in which case the captured
node is fontified in that face, or a function name, in which case the
captured node is passed to the function for fontification. Specifically,
the function is passed three arguments (BEG END NODE)
, where
BEG
and END
is the beginning and end position
of the node in the buffer, for convenience.
An example SETTING
for C is
(tree-sitter-c ; LANGUAGE ((null) @font-lock-constant-face (true) @font-lock-constant-face (false) @font-lock-constant-face)) ; QUERY
tree-sitter-font-lock-defaults
Tree-sitter font-lock, like font-lock, support fontification at
different levels of decoration (controlled by
font-lock-maximum-decoration
). And this is the primary
purpose of tree-sitter-font-lock-defaults
. Its value is a
list of
(DEFAULT :KEYWORD VALUE...)
Where each DEFAULT
may be a symbol or a list of symbols.
The symbol should be either a variable containing (LANGUAGE
QUERY)
, or a function that returns that. If DEFAULT
is a list, each symbol corresponds to a decoration level. For example, if
I want to implement three levels of decoration for C, I would populate
tree-sitter-font-lock-defaults
with
(((c-font-lock-settings-1 c-font-lock-settings-2 c-font-lock-settings-3) :KEYWORD VALUE...))
where c-font-lock-settings-1
would contain, say,
(tree-sitter-c ((null) @font-lock-constant-face (true) @font-lock-constant-face (false) @font-lock-constant-face))
for those who need no more. And the other two levels could be for the
rest mortals. As for :KEYWORD
and VALUE
, they
are analogues to that in font-lock-defaults
, used for
specifying other configurations. Currently they are not used for
tree-sitter font-lock.
To enable tree-sitter font-lock, a major mode should first assign
tree-sitter-font-lock-defaults
, then call
tree-sitter-font-lock-enable
. For example,
(define-derived-mode ts-c-mode prog-mode "tree-sitter C" (setq-local tree-sitter-font-lock-defaults '((ts-c-tree-sitter-settings-1))) (tree-sitter-enable-font-lock))
Indentation
In Emacs, indentation is provided by
indent-line-function
. Tree-sitter provides a convenient
system, tree-sitter-simple-indent, to simplify the
implementation of a indenting function. To use it, bind
indent-line-function
to tree-sitter-indent
, and
fill in indentation configurations in
tree-sitter-simple-indent-rules
.
tree-sitter-simple-indent-rules
is a list of rules, and
each rule looks like
(MATCHER ANCHOR OFFSET)
When indenting, tree-sitter-simple-indent finds the largest
node that starts at the beginning of the current line, and matches it
against each MATCHER
in
tree-sitter-simple-indent-rules
. If MATCHER
matches that node, ANCHOR
and OFFSET
determines
how to indent—find the column of ANCHOR
(which represents a
point), and add OFFSET
to it.
By now you must be wondering what the heck is MATCHER
. It
is a function that takes (NODE PARENT BOL &rest _)
as
arguments, if the rule should apply to NODE
, it returns
non-nil. PARENT
and BOL
(position of beginning
of line) are provided just for convenience. The “&rest
_
” part is required to allow the possibility to extend the
interface in the future.
This function can do anything: check the type of that node, check the
type of its parent, check whether this node is the first child node of
its parent, etc. ANCHOR
is also a function that takes theses
arguments, but it returns a point, the “anchor”. If the rule determines
that the node should be indented two columns inward comparing to its
parent, ANCHOR
should return the start of the parent node,
and OFFSET
should be 2.
For example, the following rule matches any line that starts with the
null
keyword, and indents the line inwards by two columns
against the null
’s parent node.
((lambda (n p bol &rest _) (equal (tree-sitter-node-type n) "null")) ; MATCHER (lambda (n p bol &rest _) (tree-sitter-node-start (tree-sitter-node-parent n))) ; ANCHOR 2) ; OFFSET
Of course, it is terribly tedious to write out every
MATCHER
and ANCHOR
explicitly.
tree-sitter-simple-indent provides some predefined
MATCHER
and ANCHOR
functions. Most of them are
higher-order functions: they takes an argument and returns a
function.
MATCHER
presets:
(parent-is TYPE)
- Check that the parent has type
TYPE
. (node-is TYPE)
- Check that node has type
TYPE
. (match NODE-TYPE PARENT-TYPE NODE-FIELD NODE-INDEX-MIN NODE-INDEX-MAX)
NODE-TYPE
checks for node’s type,PARENT-TYPE
checks for parent’s type,NODE-FIELD
checks for the field name for node int the parent,NODE-INDEX-MIN
andNODE-INDEX-MAX
limits the node’s index in the parent. Any argument left as nil are not checked. For example, to match the node that is the first child and has a parent of typeargument_list
, use
(match nil "argument_list" nil nil 0 0)
(query QUERY)
- Queries the parent with
QUERY
. Matches if the node is captured by any capture name. no-node
- Matches null node. When the current line is empty, there is no node at the beginning, so the node is nil.
ANCHOR
presets:
first-child
- Finds the first sibling of node, ie, the first child of the parent.
parent
- Finds the parent node.
prev-sibling
- Finds node’s first sibling.
no-indent
- Do nothing, don’t indent. This is useful for a indenting a line inside a multiline string, where masterful inactivity is most preferred.
prev-line
- Find the named node on the previous line. This can be used when indenting an empty line: just indent like the previous node.
Some handy tools
I have two handy tools for you to work with tree-sitter more easily:
first, tree-sitter-inspect-mode
will show the relevant
information of the node at point in the mode-line; second,
tree-sitter-check-indent
can check the indent result against
a stock major mode. Check out their docstring for more detail.
Feedback
You can send a message to emacs-devel, or open an issue on the GitHub repository.
An example
All these must be pretty confusing without seeing a concrete example,
so here it is. This example code is for a demo C major mode,
ts-c-mode
, defined in the “;;; Lab
” section in
tree-sitter.el
. (Here is a
link to the file on GitHub.)
Indent:
(defvar ts-c-tree-sitter-indent-rules `((tree-sitter-c ;; Empty line. (no-node prev-line 0) ;; Function/struct definition body {}. ((match nil "function_definition" "body") parent 0) ((node-is "field_declaration_list") parent 0) ;; Call expression. ((parent-is "call_expression") parent 2) ;; If-else. ((match nil "if_statement" "condition") parent 2) ((match nil "if_statement" "consequence") parent 2) ((match nil "if_statement" "alternative") parent 2) ((match nil "switch_statement" "condition") parent 2) ((node-is "else") parent 0) ;; Switch case. ((parent-is "case_statement") parent 2) ((node-is "case_statement") parent 0) ;; { and }. ((node-is "compound_statement") parent 2) ((node-is "}") parent 0) ;; Multi-line string. ((parent-is "string_literal") no-indent 0) ;; List. ,@(cl-loop for type in '("compound_statement" "initializer_list" "argument_list" "parameter_list" "field_declaration_list") collect `((match nil ,type nil 0 0) parent 2) collect `((match nil ,type nil 1) first-sibling 0)))))
Font-lock:
(defvar ts-c-tree-sitter-settings-1 '(tree-sitter-c ((null) @font-lock-constant-face (true) @font-lock-constant-face (false) @font-lock-constant-face (comment) @font-lock-comment-face (system_lib_string) @ts-c-fontify-system-lib (unary_expression operator: _ @font-lock-negation-char-face) (string_literal) @font-lock-string-face (char_literal) @font-lock-string-face (function_definition declarator: (identifier) @font-lock-function-name-face) (declaration declarator: (identifier) @font-lock-function-name-face) (function_declarator declarator: (identifier) @font-lock-function-name-face) (init_declarator declarator: (identifier) @font-lock-variable-name-face) (parameter_declaration declarator: (identifier) @font-lock-variable-name-face) (preproc_def name: (identifier) @font-lock-variable-name-face) (enumerator name: (identifier) @font-lock-variable-name-face) (field_identifier) @font-lock-variable-name-face (parameter_list (parameter_declaration (identifier) @font-lock-variable-name-face)) (pointer_declarator declarator: (identifier) @font-lock-variable-name-face) (array_declarator declarator: (identifier) @font-lock-variable-name-face) (preproc_function_def name: (identifier) @font-lock-variable-name-face parameters: (preproc_params (identifier) @font-lock-variable-name-face)) (type_identifier) @font-lock-type-face (primitive_type) @font-lock-type-face "auto" @font-lock-keyword-face "break" @font-lock-keyword-face "case" @font-lock-keyword-face "const" @font-lock-keyword-face "continue" @font-lock-keyword-face "default" @font-lock-keyword-face "do" @font-lock-keyword-face "else" @font-lock-keyword-face "enum" @font-lock-keyword-face "extern" @font-lock-keyword-face "for" @font-lock-keyword-face "goto" @font-lock-keyword-face "if" @font-lock-keyword-face "register" @font-lock-keyword-face "return" @font-lock-keyword-face "sizeof" @font-lock-keyword-face "static" @font-lock-keyword-face "struct" @font-lock-keyword-face "switch" @font-lock-keyword-face "typedef" @font-lock-keyword-face "union" @font-lock-keyword-face "volatile" @font-lock-keyword-face "while" @font-lock-keyword-face "long" @font-lock-type-face "short" @font-lock-type-face "signed" @font-lock-type-face "unsigned" @font-lock-type-face "#include" @font-lock-preprocessor-face "#define" @font-lock-preprocessor-face "#ifdef" @font-lock-preprocessor-face "#ifndef" @font-lock-preprocessor-face "#endif" @font-lock-preprocessor-face "#else" @font-lock-preprocessor-face "#elif" @font-lock-preprocessor-face )))