RFC: Emacs tree-sitter integration
Tree-sitter is a incremental parser that can provide a concrete syntax tree for the source code and is fast enough to parse on each key press. It has supported a wide range of languages, and support for more languages is on the way.
I’ve been working on a integration of tree-sitter library into Emacs’ core. The integration consists of two parts, first the direct translate of tree-sitter’s API, second the integration with Emacs’ font-lock and indent system. The first part is completed and is rather uncontentious. I’d appreciate comments on the second: Is the interface easy to understand? Is it easy to use? Is it flexible enough for every language?
Whether you are a major mode author or just a interested Emacs user, I invite you to try hacking with this tree-sitter integration—recreate existing major mode features (font-lock, indent), create new features (structured editing, etc)—and tell me how well it works. Better yet, provide some suggestions on improving the interface.
Building Emacs with tree-sitter support
Install tree-sittter
First, install libtree-sitter, either by a package manager, or from source:
git clone https://github.com/tree-sitter/tree-sitter.git cd tree-sitter make make install
This should install libtree-sitter in standard location.
Build Emacs
Then, build Emacs from my GitHub repository. Make sure you clone the
ts branch.
git clone https://github.com/casouri/emacs.git --branch ts ./autogen.sh ./configure make
No need for special configure flags, tree-sitter is enabled automatically if libtree-sitter is present on the system. Now Emacs can be started by
src/emacs
Get language definitions
To use tree-sitter features in any meaningful way, we also need the language definition, eg, libtree-sitter-c for C. I wrote a script for automatically retrieving and compiling some of the libraries. The following commands
git clone https://github.com/casouri/tree-sitter-module.git cd tree-sitter-module ./batch-new.sh
should produce libraries for C, JSON, Go, HTML, JavaScript, CSS and
Python and store them in dist directory. From there you can
copy these libraries to a standard path, or add that directory to
LD_LIBRARY_PATH.
You can also find pre-built libraries in the release page: tree-sitter-module release v2.0.
Basic tree-sitter features
I suggest reading the tree-sitter node in the manual first, it covers how to create a parser, how to retrieve a node, how to pattern match nodes, and more. You can access the manual by typing
C-h i m elisp RET g Parsing Program Source RET
The command(s) above opens the Info reader, goes to Elisp Reference Manual, and opens the “Parsing Program Source” node, which contains manual for tree-sitter. Alternatively, you can read the tree-sitter node that I clipped from the HTML manuel.
Once you’ve read the manual, you can (require
'tree-sitter) and hack away!
The manual only documents basic features of tree-sitter, leaving out font-lock and indent integration, because I expect the latter to change. They are instead documented below.
Font-lock interface
(From now on, I assume you have read the manual and I will use concepts introduced in the manual without explanation.)
If you are familiar with font-lock in Emacs, you know it is primarily
configured by font-lock-defaults: major mode sets this
variable with language-specific configuration, font-lock takes that
variable and populate font-lock-keywords, which directly
defines the pattern to fontify.
tree-sitter-font-lock-settings
Tree-sitter1
provides two analogues variables,
tree-sitter-font-lock-defaults and
tree-sitter-font-lock-settings.
tree-sitter-font-lock-settings is a list of
SETTINGs where each SETTING looks like
(LANGUAGE QUERY)
LANGUAGE is the language this setting should use, and
QUERY is either a string or a sexp query. Each capture name
in QUERY is either a face name, in which case the captured
node is fontified in that face, or a function name, in which case the
captured node is passed to the function for fontification. Specifically,
the function is passed three arguments (BEG END NODE), where
BEG and END is the beginning and end position
of the node in the buffer, for convenience.
An example SETTING for C is
(tree-sitter-c ; LANGUAGE ((null) @font-lock-constant-face (true) @font-lock-constant-face (false) @font-lock-constant-face)) ; QUERY
tree-sitter-font-lock-defaults
Tree-sitter font-lock, like font-lock, support fontification at
different levels of decoration (controlled by
font-lock-maximum-decoration). And this is the primary
purpose of tree-sitter-font-lock-defaults. Its value is a
list of
(DEFAULT :KEYWORD VALUE...)
Where each DEFAULT may be a symbol or a list of symbols.
The symbol should be either a variable containing (LANGUAGE
QUERY), or a function that returns that. If DEFAULT
is a list, each symbol corresponds to a decoration level. For example, if
I want to implement three levels of decoration for C, I would populate
tree-sitter-font-lock-defaults with
(((c-font-lock-settings-1 c-font-lock-settings-2 c-font-lock-settings-3) :KEYWORD VALUE...))
where c-font-lock-settings-1 would contain, say,
(tree-sitter-c ((null) @font-lock-constant-face (true) @font-lock-constant-face (false) @font-lock-constant-face))
for those who need no more. And the other two levels could be for the
rest mortals. As for :KEYWORD and VALUE, they
are analogues to that in font-lock-defaults, used for
specifying other configurations. Currently they are not used for
tree-sitter font-lock.
To enable tree-sitter font-lock, a major mode should first assign
tree-sitter-font-lock-defaults, then call
tree-sitter-font-lock-enable. For example,
(define-derived-mode ts-c-mode prog-mode "tree-sitter C"
(setq-local tree-sitter-font-lock-defaults
'((ts-c-tree-sitter-settings-1)))
(tree-sitter-enable-font-lock))
Indentation
In Emacs, indentation is provided by
indent-line-function. Tree-sitter provides a convenient
system, tree-sitter-simple-indent, to simplify the
implementation of a indenting function. To use it, bind
indent-line-function to tree-sitter-indent, and
fill in indentation configurations in
tree-sitter-simple-indent-rules.
tree-sitter-simple-indent-rules is a list of rules, and
each rule looks like
(MATCHER ANCHOR OFFSET)
When indenting, tree-sitter-simple-indent finds the largest
node that starts at the beginning of the current line, and matches it
against each MATCHER in
tree-sitter-simple-indent-rules. If MATCHER
matches that node, ANCHOR and OFFSET determines
how to indent—find the column of ANCHOR (which represents a
point), and add OFFSET to it.
By now you must be wondering what the heck is MATCHER. It
is a function that takes (NODE PARENT BOL &rest _) as
arguments, if the rule should apply to NODE, it returns
non-nil. PARENT and BOL (position of beginning
of line) are provided just for convenience. The “&rest
_” part is required to allow the possibility to extend the
interface in the future.
This function can do anything: check the type of that node, check the
type of its parent, check whether this node is the first child node of
its parent, etc. ANCHOR is also a function that takes theses
arguments, but it returns a point, the “anchor”. If the rule determines
that the node should be indented two columns inward comparing to its
parent, ANCHOR should return the start of the parent node,
and OFFSET should be 2.
For example, the following rule matches any line that starts with the
null keyword, and indents the line inwards by two columns
against the null’s parent node.
((lambda (n p bol &rest _)
(equal (tree-sitter-node-type n) "null")) ; MATCHER
(lambda (n p bol &rest _)
(tree-sitter-node-start
(tree-sitter-node-parent n))) ; ANCHOR
2) ; OFFSET
Of course, it is terribly tedious to write out every
MATCHER and ANCHOR explicitly.
tree-sitter-simple-indent provides some predefined
MATCHER and ANCHOR functions. Most of them are
higher-order functions: they takes an argument and returns a
function.
MATCHER presets:
(parent-is TYPE)- Check that the parent has type
TYPE. (node-is TYPE)- Check that node has type
TYPE. (match NODE-TYPE PARENT-TYPE NODE-FIELD NODE-INDEX-MIN NODE-INDEX-MAX)NODE-TYPEchecks for node’s type,PARENT-TYPEchecks for parent’s type,NODE-FIELDchecks for the field name for node int the parent,NODE-INDEX-MINandNODE-INDEX-MAXlimits the node’s index in the parent. Any argument left as nil are not checked. For example, to match the node that is the first child and has a parent of typeargument_list, use
(match nil "argument_list" nil nil 0 0)(query QUERY)- Queries the parent with
QUERY. Matches if the node is captured by any capture name. no-node- Matches null node. When the current line is empty, there is no node at the beginning, so the node is nil.
ANCHOR presets:
first-child- Finds the first sibling of node, ie, the first child of the parent.
parent- Finds the parent node.
prev-sibling- Finds node’s first sibling.
no-indent- Do nothing, don’t indent. This is useful for a indenting a line inside a multiline string, where masterful inactivity is most preferred.
prev-line- Find the named node on the previous line. This can be used when indenting an empty line: just indent like the previous node.
Some handy tools
I have two handy tools for you to work with tree-sitter more easily:
first, tree-sitter-inspect-mode will show the relevant
information of the node at point in the mode-line; second,
tree-sitter-check-indent can check the indent result against
a stock major mode. Check out their docstring for more detail.
Feedback
You can send a message to emacs-devel, or open an issue on the GitHub repository.
An example
All these must be pretty confusing without seeing a concrete example,
so here it is. This example code is for a demo C major mode,
ts-c-mode, defined in the “;;; Lab” section in
tree-sitter.el. (Here is a
link to the file on GitHub.)
Indent:
(defvar ts-c-tree-sitter-indent-rules
`((tree-sitter-c
;; Empty line.
(no-node prev-line 0)
;; Function/struct definition body {}.
((match nil "function_definition" "body") parent 0)
((node-is "field_declaration_list") parent 0)
;; Call expression.
((parent-is "call_expression") parent 2)
;; If-else.
((match nil "if_statement" "condition") parent 2)
((match nil "if_statement" "consequence") parent 2)
((match nil "if_statement" "alternative") parent 2)
((match nil "switch_statement" "condition") parent 2)
((node-is "else") parent 0)
;; Switch case.
((parent-is "case_statement") parent 2)
((node-is "case_statement") parent 0)
;; { and }.
((node-is "compound_statement") parent 2)
((node-is "}") parent 0)
;; Multi-line string.
((parent-is "string_literal") no-indent 0)
;; List.
,@(cl-loop for type in '("compound_statement" "initializer_list"
"argument_list" "parameter_list"
"field_declaration_list")
collect `((match nil ,type nil 0 0) parent 2)
collect `((match nil ,type nil 1) first-sibling 0)))))
Font-lock:
(defvar ts-c-tree-sitter-settings-1
'(tree-sitter-c
((null) @font-lock-constant-face
(true) @font-lock-constant-face
(false) @font-lock-constant-face
(comment) @font-lock-comment-face
(system_lib_string) @ts-c-fontify-system-lib
(unary_expression
operator: _ @font-lock-negation-char-face)
(string_literal) @font-lock-string-face
(char_literal) @font-lock-string-face
(function_definition
declarator: (identifier) @font-lock-function-name-face)
(declaration
declarator: (identifier) @font-lock-function-name-face)
(function_declarator
declarator: (identifier) @font-lock-function-name-face)
(init_declarator
declarator: (identifier) @font-lock-variable-name-face)
(parameter_declaration
declarator: (identifier) @font-lock-variable-name-face)
(preproc_def
name: (identifier) @font-lock-variable-name-face)
(enumerator
name: (identifier) @font-lock-variable-name-face)
(field_identifier) @font-lock-variable-name-face
(parameter_list
(parameter_declaration
(identifier) @font-lock-variable-name-face))
(pointer_declarator
declarator: (identifier) @font-lock-variable-name-face)
(array_declarator
declarator: (identifier) @font-lock-variable-name-face)
(preproc_function_def
name: (identifier) @font-lock-variable-name-face
parameters: (preproc_params
(identifier) @font-lock-variable-name-face))
(type_identifier) @font-lock-type-face
(primitive_type) @font-lock-type-face
"auto" @font-lock-keyword-face
"break" @font-lock-keyword-face
"case" @font-lock-keyword-face
"const" @font-lock-keyword-face
"continue" @font-lock-keyword-face
"default" @font-lock-keyword-face
"do" @font-lock-keyword-face
"else" @font-lock-keyword-face
"enum" @font-lock-keyword-face
"extern" @font-lock-keyword-face
"for" @font-lock-keyword-face
"goto" @font-lock-keyword-face
"if" @font-lock-keyword-face
"register" @font-lock-keyword-face
"return" @font-lock-keyword-face
"sizeof" @font-lock-keyword-face
"static" @font-lock-keyword-face
"struct" @font-lock-keyword-face
"switch" @font-lock-keyword-face
"typedef" @font-lock-keyword-face
"union" @font-lock-keyword-face
"volatile" @font-lock-keyword-face
"while" @font-lock-keyword-face
"long" @font-lock-type-face
"short" @font-lock-type-face
"signed" @font-lock-type-face
"unsigned" @font-lock-type-face
"#include" @font-lock-preprocessor-face
"#define" @font-lock-preprocessor-face
"#ifdef" @font-lock-preprocessor-face
"#ifndef" @font-lock-preprocessor-face
"#endif" @font-lock-preprocessor-face
"#else" @font-lock-preprocessor-face
"#elif" @font-lock-preprocessor-face
)))