Tree-sitter Changes in Emacs 30

A year has passed since the release of Emacs 29; last time we added support for tree-sitter and several tree-sitter-based major modes. This time, there are more major modes, better support for multi-language modes, and more utility features.

The first three sections introduce changes visible to end-users, the rest are for package and major mode developers.

Derived mode check

Now (derived-mode-p 'c-mode) returns t even in c-ts-mode (and similarly for other builtin tree-sitter modes). That means .dir-locals.el settings and yasnippets for c-mode will work for c-ts-mode too. However, c-ts-mode still doesn’t run c-mode’s major mode hooks. Also, there’s still no major mode fallback. But I think that can be solved by packages like treesit-auto.

This new inheritance doesn’t come automatically, someone needs to use derived-mode-add-parents to add the relationship.

New major modes

There are some new built-in major modes: Elixir and heex mode, html mode, Lua mode, php mode with phpdoc support, and Doxygen support for C/C++/Java mode. Kudos to Wilhelm for writing Elixir and heex mode, John for writing Lua mode, and Vincenzo for writing php mode and Doxygen support!

heex mode and php mode really shows the power of tree-sitter: without tree-sitter, it would take a lot of work to write a major mode for mixed languages like these; now tree-sitter takes care of all the hard work, and we can focus on writing the things we care about: font-lock and indentation rules, utility commands, etc.

When Wilhelm and Vincenzo were implementing multi-language major modes, they found bugs and missing features in Emacs and provided invaluable feedback on emacs-devel and the bug tracker. Their feedback and requests allow us to improve Emacs’ support for multi-languages. So if you’re writing a major mode or some package with tree-sitter and run into issues, don’t hesitate to reach out on emacs-devel or the bug tracker!

Sexp movement

I’ll explain it a bit more in the next section, but the gist is that forward-sexp and backward-sexp can now use the parse tree for navigation, as long as the major mode adds support for them. Users can also change what’s considered a sexp (A statement? An expression? Or any node in the parse tree?) themselves, overriding the major mode’s setting.

Defining things

Sections below are mostly for developers.

In the spirt of thing-at-point, a major mode or user can now define tree-sitter things: defun, sexp, sentence, comment, text, block, etc. The definition is flexible: it can be a regexp matching node names, or a predicate function, or a regexp plus a predicate. It can also be defined with logical operands not and or, like (not sexp), or (not "comment"), (or comment text).

At the moment, the following “standard” things are used by Emacs:

Like font-lock features, we’re starting with a basic list; if you have suggestions fore more things (perhaps you wrote a package that uses a thing that major modes should support), reach out on emacs-devel or debbugs.

Tree-sitter things are supported in every tree-sitter function . Once the major mode defines it, everyone can use it. Here are some things you can do with it:

Get the sexp at point in any tree-sitter major mode1: (treesit-thing-at-point 'sexp). Get the sexp before point: (treesit-thing-prev (point) 'sexp).

Provided that the mode defined the thing.

Generate a tree of all the defuns in a buffer:

(treesit-induce-sparse-tree
 (treesit-buffer-root-node)
 'defun)

Traverse things:

I can also see packages reserving a particular thing, and have major modes add definition for that thing. In that case, it’s best to add the package prefix to avoid naming conflict.

Local parsers

Normally, even for the embedded language, there’s only one parser for that language in a buffer. Each individual embedded code block are “stitched together” and is parsed as a whole by that parser. The pro is we only need to create one parser, the cons are error in one code block might affect other code blocks, and sometimes, each code block is syntactically self-contained and shouldn’t be stitched with others.

That’s why we added local parsers, with each one confined to a single code block. Emacs creates and manages parsers for each embedded code block automatically. phpdoc and Doxygen support are possible thanks to local parsers. To use local parsers, simply add the :local t flag in treesit-range-rules, and Emacs handles the rest.

Other changes

A small convenience improvement: treesit-font-lock-rules now supports the :default-language keyword, so major mode author don’t need to write :language 'xxx for every query anymore.

Each parser in the parser list now has a tag. By default, a parser has the nil tag, and (treesit-parser-list) returns all the parsers with nil tag (because the third optional argument TAG defaults to nil). That means if you don’t explicitly set a tag when creating a parser, it’ll show up when anyone calls (treesit-parser-list). On the other hand, you can create a parser that doesn’t show up in the parser list if you give it a non-nil tag. The intended use-case is to create special purpose parsers that shouldn’t normally appear in the parser list.

Local parsers has the embedded tag, so they don’t appear in the parser list. You can get them by passing embedded to the TAG argument, or by passing the special value t to the TAG argument, which means return all parsers regardless of their tag.

There’s a new variable treesit-language-remap-alist. If a language A is mapped to another language B in this alist, creating a parser of A actually uses the grammar of B. For example, if someone wants to write a major mode for tree-sitter-cuda, which extends upon tree-sitter-cpp, they can map cpp to cuda, so the font-lock rules and indentation rules defined in c++-ts-mode can be borrowed to cuda mode verbatim.

Indirect buffers now gets individual parser lists. In Emacs 29, the origin buffer and all its indirect buffers share the same parser list. Now they each have their own parser list.

Better filling for C-style comment blocks

This is not directly related to tree-sitter but it affects tree-sitter modes for all C-like languages. You see, all these tree-sitter major modes (C, C++, Java, Rust, Javascript, Typescript) uses C-style comment blocks, and they all use c-ts-common.el for things like filling the comment block, or setting up comment-start, etc.

Traditionally these kind of major modes use cc-mode’s utilities, but cc-mode is a beast on its own, and it’s not worth it to add that dependency for filling a comment block. (It’s not just code dependency, but also cc-mode’s own parsing facility, data structure, etc.) So we had to recreate these utilities in c-ts-common.el, with the bonus goal of keeping the code as easy to read as possible.

Filling C-style comment block is harder than one might imagine. It’s quite involved and interesting, and worth a separate article on its own. Suffice to say that the filling logic is improved and works on even more styles of C comment blocks now. Below are a few among the ones that we support.

/* xxx  /**    /* xxx
 * xxx   * xxx    xxx
 */      */       xxx */

/*======  /*
 * xxx     | xxx
 *======/  */

And it goes beyond just filling, when you type return in a comment block, you expect the next line to be prefixed with some character (* or | or space) and indented to the right place. Making that work for all those styles on top of the filling and keeping the code reasonably readable is a small miracle :-)

Primary parser

If you are the author of a tree-sitter major mode, make sure to set treesit-primary-parser in your major mode if it has multiple languages! This is a new variable added in Emacs 30, and setting it is vital for font-lock update to work properly in complex situations. Emacs makes a reasonable guess when the major mode doesn’t set it themselves (it sets the first parser in the parser list as the primary parser). But this guess doesn’t work reliably for multi-language major modes.

Besides Emacs itself, other packages can also make use of this variable. It’ll be better than (car (treesit-parser-list)), especially in multi-language modes.

Having an explicit primary parser allows Emacs to update the “changed region” after each buffer change correctly, especially for multi-language modes. For example, when the user types the closing block comment delimiter */, not only does Emacs fontify the */ itself, it also needs to re-fontify the whole block comment, which previously weren’t fontified in comment face due to incomplete parse tree. You can read more about it in treesit--font-lock-mark-ranges-to-fontify and treesit--pre-redisplay.

Ready your major mode for Emacs 30

Here’s a check list:

  1. Define treesit-primary-parser.
  2. Define things in treesit-thing-settings, especially sexp.

Err, that list is shorter than I thought. But I do have some more words for sexp thing.

There are multiple ways of defining the sexp thing, you can define it to any node (excluding some punctuation marks), or repeatable node (function arguments, list elements, statements, blocks, defun), or a hand-crafted list of nodes.

Defining sexp as every node (excluding punctuation) could be a good starting point. For example, this is the definition for sexp in c-ts-mode:

(not ,(rx (or "{" "}" "[" "]" "(" ")" ",")))

This way, if point is at the beginning of any thing, C-M-f will bring me to the end of that thing, be it an expression, statement, function, block, or whatever. I use it all the time and it’s very handy for selecting code.

A slight upgrade2 from this is to define sexp to anything that’s repeatable. That takes a bit more effort but C-M-f will always move you to the end of a repeatable construct. This is more inline with the concept of sexp, where we consider each repeatable construct in the code as an atom.

This is just my personal opinion, of course. “Everything is a sexp” might as well be better, since you can move over more types of things.

Emacs 31

At this point we have pretty good support for writing major modes with tree-sitter. Many tree-sitter major modes and packages appeared after Emacs 29 and it’s very encouraging. We’ll continue making it easier to write major modes with tree-sitter, and make it easier to use and configure tree-sitter modes. For example, we’ll add a baseline indentation rule, so major mode authors need to write less indentation rules. And there’re some upgrades to the sexp movement, too.

There are still some unsolved issues. The lack of versioning for language grammars breaks major modes from time to time; installing tree-sitter grammar is not very easy; tree-sitter library still has bugs that results in incorrect parse tree or even causes Emacs to hang. These will be resolved, albeit slowly.

That’s about it! Emacs has been making good progress regarding tree-sitter. And as I said last time, tree-sitter is a really good way to start contributing to Emacs. We’ve seen folks adding their tree-sitter modes into Emacs, you could be the next! Also, many existing builtin major modes lacks utiliy functions that usually come with a major mode. If you see missing feature in a mode, feel free to send a patch!

Ok folks, stay tuned for the next update for Emacs 31, and feel free to reach out in the meantime!