Next: Parsing Text in Multiple Languages, Previous: Accessing Node Information, Up: Parsing Program Source [Contents][Index]
Tree-sitter let us pattern match with a small declarative language. Pattern matching consists of two steps: first tree-sitter matches a pattern against nodes in the syntax tree, then it captures specific nodes in that pattern and returns the captured nodes.
We describe first how to write the most basic query pattern and how to capture nodes in a pattern, then the pattern-match function, finally more advanced pattern syntax.
A query consists of multiple patterns, each pattern is an s-expression that matches a certain node in the syntax node. A pattern has the following shape:
(type child...)
For example, a pattern that matches a binary_expression
node that
contains number_literal
child nodes would look like
(binary_expression (number_literal))
To capture a node in the query pattern above, append
@capture-name
after the node pattern you want to capture. For
example,
(binary_expression (number_literal) @number-in-exp)
captures number_literal
nodes that are inside a
binary_expression
node with capture name number-in-exp
.
We can capture the binary_expression
node too, with capture
name biexp
:
(binary_expression (number_literal) @number-in-exp) @biexp
Now we can introduce the query functions.
This function matches patterns in query in node. Argument query can be a either string or a s-expression. For now, we focus on the string syntax; s-expression syntax is described at the end of the section.
The function returns all captured nodes in a list of
(capture_name . node)
. If beg and end
are both non-nil, it only pattern matches nodes in that range.
This function raise a tree-sitter-query-error if query is malformed. The signal data contains a description of the specific error.
This function matches patterns in query in source, and
returns all captured nodes in a list of (capture_name
. node)
. If beg and end are both non-nil, it only
pattern match nodes in that range.
Argument source designates a node, it can be a language symbol, a parser, or simply a node. If a language symbol, source represents the root node of the first parser for that language in the current buffer; if a parser, source represents the root node of that parser.
This function also raises tree-sitter-query-error.
For example, suppose node’s content is 1 + 2
, and
query is
(setq query "(binary_expression (number_literal) @number-in-exp) @biexp")
Querying that query would return
(tree-sitter-query-capture node query) ⇒ ((biexp . <node for "1 + 2">) (number-in-exp . <node for "1">) (number-in-exp . <node for "2">))
As we mentioned earlier, a query could contain multiple patterns. For example, it could have two top-level patterns:
(setq query "(binary_expression) @biexp (number_literal) @number @biexp")
This function parses string with language, pattern matches its root node with query, and returns the result.
Besides node type and capture, tree-sitter’s query syntax can express anonymous node, field name, wildcard, quantification, grouping, alternation, anchor, and predicate.
An anonymous node is written verbatim, surrounded by quotes. A
pattern matching (and capturing) keyword return
would be
"return" @keyword
In a query pattern, ‘(_)’ matches any named node, and ‘_’
matches any named and anonymous node. For example, to capture any
named child of a binary_expression
node, the pattern would be
(binary_expression (_) @in_biexp)
We can capture child nodes that has specific field names:
(function_definition declarator: (_) @func-declarator body: (_) @func-body)
We can also capture a node that doesn’t have certain field, say, a
function_definition
without a body
field.
(function_definition !body) @func-no-body
Tree-sitter recognizes quantification operators ‘*’, ‘+’ and ‘?’. Their meanings are the same as in regular expressions: ‘*’ matches the preceding pattern zero or more times, ‘+’ matches one or more times, and ‘?’ matches zero or one time.
For example, this pattern matches type_declaration
nodes
that has zero or more long
keyword.
(type_declaration "long"* @long-in-type)
And this pattern matches a type declaration that has zero or one
long
keyword:
(type_declaration "long"?) @type-decl
Similar to groups in regular expression, we can bundle patterns into a group and apply quantification operators to it. For example, to express a comma separated list of identifiers, one could write
(identifier) ("," (identifier))*
Again, similar to regular expressions, we can express “match anyone from this group of patterns” in the query pattern. The syntax is a list of patterns enclosed in square brackets. For example, to capture some keywords in C, the query pattern would be
[ "return" "break" "if" "else" ] @keyword
The anchor operator ‘.’ can be used to enforce juxtaposition, i.e., to enforce two things to be directly next to each other. The two “things” can be two nodes, or a child and the end of its parent. For example, to capture the first child, the last child, or two adjacent children:
;; Anchor the child with the end of its parent. (compound_expression (_) @last-child .) ;; Anchor the child with the beginning of its parent. (compound_expression . (_) @first-child) ;; Anchor two adjacent children. (compound_expression (_) @prev-child . (_) @next-child)
Note that the enforcement of juxtaposition ignores any anonymous nodes.
We can add predicate constraints to a pattern. For example, if we use the following query pattern
( (array . (_) @first (_) @last .) (#equal @first @last) )
Then tree-sitter only matches arrays where the first element equals to
the last element. To attach a predicate to a pattern, we need to
group then together. A predicate always starts with a ‘#’.
Currently there are two predicates, #equal
and #match
.
Matches if arg1 equals to arg2. Arguments can be either a string or a capture name. Capture names represent the text that the captured node spans in the buffer.
Matches if the text that capture-name’s node spans in the buffer matches regular expression regexp. Matching is case-sensitive.
Note that a predicate can only refer to capture names appeared in the same pattern. Indeed, it makes little sense to refer to capture names in other patterns anyway.
Besides strings, Emacs provides a s-expression based syntax for query patterns. It largely resembles the string-based syntax. For example, the following pattern
(tree-sitter-query-capture node "(addition_expression left: (_) @left \"+\" @plus-sign right: (_) @right) @addition [\"return\" \"break\"] @keyword")
is equivalent to
(tree-sitter-query-capture node '((addition_expression left: (_) @left "+" @plus-sign right: (_) @right) @addition ["return" "break"] @keyword))
Most pattern syntax can be written directly as strange but never-the-less valid s-expressions. Only a few of them needs modification:
:anchor
.
#equal
is written as :equal
. In general, predicates
change their ‘#’ to ‘:’.
For example,
"( (compound_expression . (_) @first (_)* @rest) (#match \"love\" @first) )"
is written in s-expression as
'(( (compound_expression :anchor (_) @first (_) :* @rest) (:match "love" @first) ))
This function expands the s-expression query into a string query. It is usually a good idea to expand the s-expression patterns into strings for font-lock queries since they are called repeatedly.
Tree-sitter project’s documentation about pattern-matching can be found at https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries.
Next: Parsing Text in Multiple Languages, Previous: Accessing Node Information, Up: Parsing Program Source [Contents][Index]