Tree-sitter Starter Guide
This guide gives you a starting point on writing a tree-sitter major mode. Remember, don’t panic and check your manuals!
Build Emacs with tree-sitter
You can either install tree-sitter by your package manager, or from
source:
git clone https://github.com/tree-sitter/tree-sitter.git cd tree-sitter make make install
To build and run Emacs 29:
git clone https://git.savannah.gnu.org/git/emacs.git -b emacs-29 cd emacs ./autogen.sh ./configure make src/emacs
Require the tree-sitter package with (require 'treesit)
.
Note that tree-sitter always appear as treesit
in symbols.
Now check if Emacs is successfully built with tree-sitter library by
evaluating (treesit-available-p)
.
Tree-sitter stuff in Emacs can be categorized into two parts: the
tree-sitter API itself, and integration with fontification, indentation,
Imenu, etc. You can use shortdoc to glance over all the tree-sitter API
functions by typing M-x shortdoc RET treesit RET
. The
integration are described in the rest of the post.
Install language definitions
Tree-sitter by itself doesn’t know how to parse any particular language. It needs the language grammar (a dynamic library) for a language to be able to parse it.
First, find the repository for the language grammar, eg, tree-sitter-python.
Take note of the Git clone URL of it, eg,
https://github.com/tree-sitter/tree-sitter-python.git
. Now
check where is the parser.c file in that repository, usually it’s in
src
.
Make sure you have Git, C and C++ compiler, and run the
treesit-install-grammar
command, it will prompt for the URL
and the directory of parser.c, leave other prompts at default unless you
know what you are doing.
You can also manually clone the repository and compile it, and put the
dynamic library at a standard library location. Emacs will be able to
find it. If you wish to put it somewhere else, set
treesit-extra-load-path
so Emacs can find it.
Tree-sitter major modes
Tree-sitter modes should be separate major modes, usually named
xxx-ts-mode
. I know I said tree-sitter always appear as
treesit
in symbols, this is the only exception.
If the tree-sitter mode and the “native” mode could share some setup code, you can create a “base mode”, which only contains the common setup. For example, there is python-base-mode (shared), and both python-mode (native), and python-ts-mode (tree-sitter) derives from it.
In the tree-sitter mode, check if we can use tree-sitter with
treesit-ready-p
, it will emit a warning if tree-sitter is
not ready (tree-sitter not built with Emacs, can’t find the language
grammar, buffer too large, etc).
Fontification
Tree-sitter works like this: It parses the buffer and produces a
parse
tree. You provide a query made of patterns and capture names,
tree-sitter finds the nodes that match these patterns, tag the
corresponding capture names onto the nodes and return them to you. The
query function returns a list of (capture-name . node)
.
For fontification, we simply use face names as capture names. And the captured node will be fontified in their capture name (the face).
The capture name could also be a function, in which case (NODE
OVERRIDE START END)
is passed to the function for fontification.
START
and END
are the start and end of the
region to be fontified. The function should only fontify within that
region. The function should also allow more optional arguments with
&rest _
, for future extensibility. For
OVERRIDE
check out the docstring of
treesit-font-lock-rules
.
Query syntax
There are two types of nodes: “named nodes”, like
(identifier)
, (function_definition)
, and
“anonymous nodes”, like "return"
, "def"
,
"("
, ";"
. Parent-child relationship is
expressed as
(parent (child) (child) (child (grand_child)))
Eg, an argument list (1, "3", 1)
would be:
(argument_list "(" (number) (string) (number) ")")
Children could have field names:
(function_definition name: (identifier) type: (identifier))
To match any one in the list:
["true" "false" "none"]
Capture names can come after any node in the pattern:
(parent (child) @child) @parent
The query above captures both the parent and the child.
The query below captures all the keywords with capture name
"keyword"
:
["return" "continue" "break"] @keyword
These are the common syntax, check out the full syntax in the manual: Pattern Matching.
Query references
But how do one come up with the queries? Take python for an example,
open any python source file, type M-x treesit-explore-mode
RET
. You should see the parse tree in a separate window,
automatically updated as you select text or edit the buffer. Besides
this, you can consult the grammar of the language definition. For
example, Python’s grammar file is at
https://github.com/tree-sitter/tree-sitter-python/blob/master/grammar.js
Neovim also has a bunch of queries to reference from.
The manual explains how to read grammar files in the bottom of Language Grammar.
Debugging queries
If your query has problems, use treesit-query-validate
to
debug the query. It will pop a buffer containing the query (in text
format) and mark the offending part in red. Set
treesit--font-lock-verbose
to t
if you want the
font-lock function to report what it’s doing.
Set up font-lock
To enable tree-sitter font-lock, set
treesit-font-lock-settings
and
treesit-font-lock-feature-list
buffer-locally and call
treesit-major-mode-setup
. For example, see
python--treesit-settings
in python.el. Below is a snippet of
it.
Note that like the current font-lock system, if the to-be-fontified
region already has a face (ie, an earlier match fontified part/all of the
region), the new face is discarded rather than applied. If you want later
matches always override earlier matches, use the :override
keyword.
Each rule should have a :feature
, like
function-name
, string-interpolation
,
builtin
, etc. This way users can enable/disable each feature
individually.
Read the manual section Parser-based Font-Lock for more detail.
Example from python.el:
(defvar python--treesit-settings (treesit-font-lock-rules :feature 'comment :language 'python '((comment) @font-lock-comment-face) :feature 'string :language 'python '((string) @python--treesit-fontify-string) :feature 'string-interpolation :language 'python :override t '((interpolation (identifier) @font-lock-variable-name-face)) ...))
In python-ts-mode
:
(treesit-parser-create 'python) (setq-local treesit-font-lock-settings python--treesit-settings) (setq-local treesit-font-lock-feature-list '(( comment definition) ( keyword string type) ( assignment builtin constant decorator escape-sequence number property string-interpolation ) ( bracket delimiter function operator variable))) ... (treesit-major-mode-setup)
Concretely, something like this:
(define-derived-mode python-ts-mode python-base-mode "Python" "Major mode for editing Python files, using tree-sitter library. \\{python-ts-mode-map}" :syntax-table python-mode-syntax-table (when (treesit-ready-p 'python) (treesit-parser-create 'python) (setq-local treesit-font-lock-feature-list '(( comment definition) ( keyword string type) ( assignment builtin constant decorator escape-sequence number property string-interpolation ) ( bracket delimiter function operator variable))) (setq-local treesit-font-lock-settings python--treesit-settings) (setq-local imenu-create-index-function #'python-imenu-treesit-create-index) (setq-local treesit-defun-type-regexp (rx (or "function" "class") "_definition")) (setq-local treesit-defun-name-function #'python--treesit-defun-name) (treesit-major-mode-setup) (when python-indent-guess-indent-offset (python-indent-guess-indent-offset))))
Indentation
Indentation works like this: We have a bunch of rules that look like
(MATCHER ANCHOR OFFSET)
When the indenting a line, let NODE
be the node at the
beginning of the current line, we pass this node to the
MATCHER
of each rule, one of them will match the node (eg,
“this node is a closing bracket!”). Then we pass the node to the
ANCHOR
, which returns a point (eg, the beginning of
NODE
’s parent). We find the column number of that point (eg,
4), add OFFSET
to it (eg, 0), and that is the column we want
to indent the current line to (4 + 0 = 4).
Matchers and anchors are functions that takes (NODE PARENT BOL
&rest _)
. Matches return nil/non-nil for no match/match, and
anchors return the anchor point. An Offset is usually a number or a
variable, but it can also be a function. Below are some convenient
builtin matchers and anchors.
For MATHCER
we have
(parent-is TYPE) => matches if PARENT’s type matches TYPE as regexp (node-is TYPE) => matches NODE’s type (query QUERY) => matches if querying PARENT with QUERY captures NODE. (match NODE-TYPE PARENT-TYPE NODE-FIELD NODE-INDEX-MIN NODE-INDEX-MAX) => checks everything. If an argument is nil, don’t match that. Eg, (match nil TYPE) is the same as (parent-is TYPE)
For ANCHOR
we have
first-sibling => start of the first sibling parent => start of parent parent-bol => BOL of the line parent is on. prev-sibling => start of previous sibling no-indent => current position (don’t indent) prev-line => start of previous line
There is also a manual section for indent: Parser-based Indentation.
When writing indent rules, you can use
treesit-check-indent
to
check if your indentation is correct. To debug what went wrong, set
treesit--indent-verbose
to t
. Then when you
indent, Emacs
tells you which rule is applied in the echo area.
Here is an example:
(defvar typescript-mode-indent-rules (let ((offset 'typescript-indent-offset)) `((typescript ;; This rule matches if node at point is ")", ANCHOR is the ;; parent node’s BOL, and offset is 0. ((node-is ")") parent-bol 0) ((node-is "]") parent-bol 0) ((node-is ">") parent-bol 0) ((node-is "\\.") parent-bol ,offset) ((parent-is "ternary_expression") parent-bol ,offset) ((parent-is "named_imports") parent-bol ,offset) ((parent-is "statement_block") parent-bol ,offset) ((parent-is "type_arguments") parent-bol ,offset) ((parent-is "variable_declarator") parent-bol ,offset) ((parent-is "arguments") parent-bol ,offset) ((parent-is "array") parent-bol ,offset) ((parent-is "formal_parameters") parent-bol ,offset) ((parent-is "template_substitution") parent-bol ,offset) ((parent-is "object_pattern") parent-bol ,offset) ((parent-is "object") parent-bol ,offset) ((parent-is "object_type") parent-bol ,offset) ((parent-is "enum_body") parent-bol ,offset) ((parent-is "arrow_function") parent-bol ,offset) ((parent-is "parenthesized_expression") parent-bol ,offset) ...))))
Then you set treesit-simple-indent-rules
to your rules,
and call treesit-major-mode-setup
.
Imenu
Set treesit-simple-imenu-settings
and call
treesit-major-mode-setup
.
Navigation
Set treesit-defun-type-regexp
,
treesit-defun-name-function
, and call
treesit-major-mode-setup
.
C-like languages
[Update: Common functions described in this section have been moved from c-ts-mode.el to c-ts-common.el. I also made some changes to the functions and variables themselves.]
c-ts-common.el has some goodies for handling indenting and filling block comments.
These two rules should take care of indenting block comments.
((and (parent-is "comment") c-ts-common-looking-at-star) c-ts-common-comment-start-after-first-star -1) ((parent-is "comment") prev-adaptive-prefix 0)
standalone-parent
should be enough for most of the cases
where you want to "indent one level further", for example, a statement
inside a block. Normally standalone-parent
returns the
parent’s start position as the anchor, but if the parent doesn’t start on
its own line, it returns the parent’s parent instead, and so on and so
forth. This works pretty well in practice. For example, indentation rules
for statements and brackets would look like:
;; Statements in {} block. ((parent-is "compound_statement") standalone-parent x-mode-indent-offset) ;; Closing bracket. ((node-is "}") standalone-parent x-mode-indent-offset) ;; Opening bracket. ((node-is "compound_statement") standalone-parent x-mode-indent-offset)
You’ll need additional rules for “brackless” if/for/while statements, eg
if (true) return 0; else return 1;
You need rules like these:
((parent-is "if_statement") standalone-parent x-mode-indent-offset)
Finally, c-ts-common-comment-setup
will set up comment
and filling for you.
Multi-language modes
Refer to the manual: Multiple Languages.
Common Tasks
M-x shortdoc RET treesit RET
will give you a complete
list.
How to...
Get the buffer text corresponding to a node?
(treesit-node-text node)
Don’t confuse this with treesit-node-string
.
Scan the whole tree for stuff?
(treesit-search-subtree) (treesit-search-forward) (treesit-induce-sparse-tree)
Find/move to to next node that...?
(treesit-search-forward node ...) (treesit-search-forward-goto node ...)
Get the root node?
(treesit-buffer-root-node)
Get the node at point?
(treesit-node-at (point))