Contents

lexer

Performs lexing of Scintilla documents.

Overview

Dynamic lexers are more flexible than Scintilla’s static ones. They are often more readable as well. This document provides all the information necessary in order to write a new lexer. For illustrative purposes, a Lua lexer will be created. Lexers are written using Parsing Expression Grammars or PEGs with the Lua LPeg library. Please familiarize yourself with LPeg’s documentation before proceeding.

Writing a Lexer

Rather than writing a lexer from scratch, first see if your language is similar to any of the 80+ languages supported. If so, you can copy and modify that lexer, saving some time and effort.

Introduction

All lexers are contained in the lexers/ directory. To begin, create a Lua script with the name of your lexer and open it for editing.

$> cd lexers
$> textadept lua.lua

Inside the file, the lexer should look like the following:

-- Lua LPeg lexer

local l = lexer
local token, word_match = l.token, l.word_match
local P, R, S, V = lpeg.P, lpeg.R, lpeg.S, lpeg.V

local M = { _NAME = 'lua' }

-- Lexer code goes here.

return M

where the value of _NAME should be replaced with your lexer’s name.

Like most Lua modules, the lexer module will store all of its data in a table M so as not to clutter the global namespace with lexer-specific patterns and variables. Therefore, remember to use the prefix M when declaring and using non-local variables. Also, do not forget the return M at the end.

The local variables above give easy access to the many useful functions available for creating lexers.

Language Structure

It is important to spend some time considering the structure of the language you are creating the lexer for. What kinds of tokens does it have? Comments, strings, keywords, etc.? Lua has 9 tokens: whitespace, comments, strings, numbers, keywords, functions, constants, identifiers, and operators.

Tokens

In a lexer, tokens are comprised of a token type followed by an LPeg pattern. They are created using the token() function. The lexer (l) module provides a number of default token types:

Please note you are not limited to just these token types; you can create your own. If you create your own, you will have to specify how they are colored. The procedure is discussed later.

A whitespace token typically looks like:

local ws = token(l.WHITESPACE, S('\t\v\f\n\r ')^1)

It is difficult to remember that a space character is either a \t, \v, \f, \n, \r, or . The lexer module also provides you with a shortcut for this and many other character sequences. They are:

The above whitespace token can be rewritten more simply as:

local ws = token(l.WHITESPACE, l.space^1)

The next Lua token is a comment. Short comments beginning with -- are easy to express with LPeg:

local line_comment = '--' * l.nonnewline^0

On the other hand, long comments are more difficult to express because they have levels. See the Lua Reference Manual for more information. As a result, a functional pattern is necessary:

local longstring = #('[[' + ('[' * P('=')^0 * '[')) *
  P(function(input, index)
    local level = input:match('^%[(=*)%[', index)
    if level then
      local _, stop = input:find(']'..level..']', index, true)
      return stop and stop + 1 or #input + 1
    end
  end)
local block_comment = '--' * longstring

The token for a comment is then:

local comment = token(l.COMMENT, line_comment + block_comment)

It is worth noting that while token names are arbitrary, you are encouraged to use the ones listed in the tokens table because a standard color theme is applied to them. If you wish to create a unique token, no problem. You can specify how it will be displayed later on.

Lua strings should be easy to express because they are just characters surrounded by ' or " characters, right? Not quite. Lua strings contain escape sequences (\char) so a \' sequence in a single-quoted string does not indicate the end of a string and must be handled appropriately. Fortunately, this is a common occurance in many programming languages, so a convenient function is provided: delimited_range().

local sq_str = l.delimited_range("'", '\\', true)
local dq_str = l.delimited_range('"', '\\', true)

Lua also has multi-line strings, but they have the same format as block comments. All strings can all be combined into a token:

local string = token(l.STRING, sq_str + dq_str + longstring)

Numbers are easy in Lua using lexer’s predefined patterns.

local lua_integer = P('-')^-1 * (l.hex_num + l.dec_num)
local number = token(l.NUMBER, l.float + lua_integer)

Keep in mind that the predefined patterns may not be completely accurate for your language, so you may have to create your own variants. In the above case, Lua integers do not have octal sequences, so the l.integer pattern is not used.

Depending on the number of keywords for a particular language, a simple P(keyword1) + P(keyword2) + ... + P(keywordN) pattern can get quite large. In fact, LPeg has a limit on pattern size. Also, if the keywords are not case sensitive, additional complexity arises, so a better approach is necessary. Once again, lexer has a shortcut function: word_match().

local keyword = token(l.KEYWORD, word_match {
  'and', 'break', 'do', 'else', 'elseif', 'end', 'false', 'for',
  'function', 'goto', 'if', 'in', 'local', 'nil', 'not', 'or', 'repeat',
  'return', 'then', 'true', 'until', 'while'
})

If keywords were case-insensitive, an additional parameter would be specified in the call to word_match(); no other action is needed.

Lua functions and constants are specified like keywords:

local func = token(l.FUNCTION, word_match {
  'assert', 'collectgarbage', 'dofile', 'error', 'getmetatable',
  'ipairs', 'load', 'loadfile', 'next', 'pairs', 'pcall', 'print',
  'rawequal', 'rawget', 'rawlen', 'rawset', 'require', 'setmetatable',
  'tonumber', 'tostring', 'type', 'xpcall'
})

local constant = token(l.CONSTANT, word_match {
  '_G', '_VERSION'
})

Like most programming languages, Lua allows the usual characters in identifier names (variables, functions, etc.) so the usual l.word can be used:

local identifier = token(l.IDENTIFIER, l.word)

Lua has labels too:

local label = token(l.LABEL, '::' * l.word * '::')

Finally, an operator character is one of the following:

local operator = token(l.OPERATOR, '~=' + S('+-*/%^#=<>;:,.{}[]()'))

Rules

Rules are just a combination of tokens. In Lua, all rules consist of a single token, but other languages may have two or more tokens in a rule. For example, an HTML tag consists of an element token followed by an optional set of attribute tokens. This allows each part of the tag to be colored distinctly.

The set of rules that comprises Lua is specified in a M._rules table for the lexer.

M._rules = {
  { 'whitespace', ws },
  { 'keyword', keyword },
  { 'function', func },
  { 'constant', constant },
  { 'identifier', identifier },
  { 'string', string },
  { 'comment', comment },
  { 'number', number },
  { 'label', label },
  { 'operator', operator },
  { 'any_char', l.any_char },
}

Each entry is a rule name and its associated pattern. Please note that the names of the rules can be completely different than the names of the tokens contained within them.

The order of the rules is important because of the nature of LPeg. LPeg tries to apply the first rule to the current position in the text it is matching. If there is a match, it colors that section appropriately and moves on. If there is not a match, it tries the next rule, and so on. Suppose instead that the identifier rule was before the keyword rule. It can be seen that all keywords satisfy the requirements for being an identifier, so any keywords would be incorrectly colored as identifiers. This is why identifier is where it is in the M._rules table.

You might be wondering what that any_char is doing at the bottom of M._rules. Its purpose is to match anything not accounted for in the above rules. For example, suppose the ! character is in the input text. It will not be matched by any of the first 9 rules, so without any_char, the text would not match at all, and no coloring would occur. any_char matches one single character and moves on. It may be colored red (indicating a syntax error) if desired because it is a token, not just a pattern.

Summary

The above method of defining tokens and rules is sufficient for a majority of lexers. The lexer module provides many useful patterns and functions for constructing a working lexer quickly and efficiently. In most cases, the amount of knowledge of LPeg required to write a lexer is minimal.

As long as you used the default token types provided by lexer, you do not have to specify any coloring (or styling) information in the lexer; it is taken care of by the user’s color theme.

The rest of this document is devoted to more complex lexer techniques.

Styling Tokens

The term for coloring text is styling. Just like with predefined LPeg patterns in lexer, predefined styles are available.

Each style consists of a set of attributes:

Styles are created with style(). For example:

-- style with default theme settings
local style_nothing = l.style { }

-- style with bold text with default theme font
local style_bold = l.style { bold = true }

-- style with bold italic text with default theme font
local style_bold_italic = l.style { bold = true, italic = true }

The style_bold_italic style can be rewritten in terms of style_bold:

local style_bold_italic = style_bold..{ italic = true }

In this way you can build on previously defined styles without having to rewrite them. Note the previous style is left unchanged.

Style colors are different than the #rrggbb RGB notation you may be familiar with. Instead, create a color using color().

local red = l.color('FF', '00', '00')
local green = l.color('00', 'FF', '00')
local blue = l.color('00', '00', 'FF')

The default set of colors varies depending on the color theme used. Please see the current theme for more information.

Finally, styles are assigned to tokens via a M._tokenstyles table in the lexer. Styles do not have to be assigned to the default tokens; it is done automatically. You only have to assign styles for tokens you create. For example:

local lua = token('lua', P('lua'))

-- ... other patterns and tokens ...

M._tokenstyles = {
  { 'lua', l.style_keyword },
}

Each entry is the token name the style is for and the style itself. The order of styles in M._tokenstyles does not matter.

For examples of how styles are created, please see the theme files in the lexers/themes/ folder.

Line Lexer

Sometimes it is advantageous to lex input text line by line rather than a chunk at a time. This occurs particularly in diff, patch, or make files. Put

M._LEXBYLINE = true

somewhere in your lexer in order to do this.

Embedded Lexers

A particular advantage that dynamic lexers have over static ones is that lexers can be embedded within one another very easily, requiring minimal effort. There are two kinds of embedded lexers: a parent lexer that embeds other child lexers in it, and a child lexer that embeds itself within a parent lexer.

Parent Lexer

An example of this kind of lexer is HTML with embedded CSS and Javascript. After creating the parent lexer, load the children lexers in it using lexer.load(). For example:

local css = l.load('css')

There needs to be a transition from the parent HTML lexer to the child CSS lexer. This is something of the form <style type="text/css">. Similarly, the transition from child to parent is </style>.

local css_start_rule = #(P('<') * P('style') * P(function(input, index)
  if input:find('^[^>]+type%s*=%s*(["\'])text/css%1', index) then
    return index
  end
end)) * tag
local css_end_rule = #(P('</') * P('style') * ws^0 * P('>')) * tag

where tag and ws have been previously defined in the HTML lexer.

Now the CSS lexer can be embedded using embed_lexer():

l.embed_lexer(M, css, css_start_rule, css_end_rule)

Remember M is the parent HTML lexer object. The lexer object is needed by embed_lexer().

The same procedure can be done for Javascript.

local js = l.load('javascript')

local js_start_rule = #(P('<') * P('script') * P(function(input, index)
  if input:find('^[^>]+type%s*=%s*(["\'])text/javascript%1', index) then
    return index
  end
end)) * tag
local js_end_rule = #('</' * P('script') * ws^0 * '>') * tag
l.embed_lexer(M, js, js_start_rule, js_end_rule)

Child Lexer

An example of this kind of lexer is PHP embedded in HTML. After creating the child lexer, load the parent lexer. As an example:

local html = l.load('hypertext')

Since HTML should be the main lexer, (PHP is just a preprocessing language), the following statement changes the main lexer from PHP to HTML:

M._lexer = html

Like in the previous section, transitions from HTML to PHP and back are specified:

local php_start_rule = token('php_tag', '<?' * ('php' * l.space)^-1)
local php_end_rule = token('php_tag', '?>')

And PHP is embedded:

l.embed_lexer(html, M, php_start_rule, php_end_rule)

Code Folding

It is sometimes convenient to “fold”, or not show blocks of text. These blocks can be functions, classes, comments, etc. A folder iterates over each line of input text and assigns a fold level to it. Certain lines can be specified as fold points that fold subsequent lines with a higher fold level.

Simple Folding

To specify the fold points of your lexer’s language, create a M._foldsymbols table of the following form:

M._foldsymbols = {
  _patterns = { 'patt1', 'patt2', ... },
  token1 = { ['fold_on'] = 1, ['stop_on'] = -1 },
  token2 = { ['fold_on'] = 1, ['stop_on'] = -1 },
  token3 = { ['fold_on'] = 1, ['stop_on'] = -1 }
  ...
}

Fold points must ultimately have a value of 1 and stop points must ultimately have a value of -1 so the value in the table could be a function as long as it returns 1, -1, or 0. Any functions are passed the following arguments:

Lua folding would be implemented as follows:

M._foldsymbols = {
  _patterns = { '%l+', '[%({%)}%[%]]' },
  [l.KEYWORD] = {
    ['if'] = 1, ['do'] = 1, ['function'] = 1, ['end'] = -1,
    ['repeat'] = 1, ['until'] = -1
  },
  [l.COMMENT] = { ['['] = 1, [']'] = -1 },
  [l.OPERATOR] = { ['('] = 1, ['{'] = 1, [')'] = -1, ['}'] = -1 }
}

_patterns matches lower-case words and any brace character. These are the fold and stop points in Lua. If a lower-case word happens to be a keyword token and that word is if, do, function, or repeat, the line containing it is a fold point. If the word is end or until, the line is a stop point. Any unmatched parenthesis or braces counted as operators are also fold points. Finally, unmatched brackets in comments are fold points in order to fold long (multi-line) comments.

Advanced Folding

If you need more granularity than M._foldsymbols, you can define your own fold function:

function M._fold(text, start_pos, start_line, start_level)

end

The function must return a table whose indices are line numbers and whose values are tables containing the fold level and optionally a fold flag.

The following Scintilla fold flags are available:

Have your fold function interate over each line, setting fold levels. You can use the get_style_at(), get_property(), get_fold_level(), and get_indent_amount() functions as necessary to determine the fold level for each line. The following example sets fold points by changes in indentation.

function M._fold(text, start_pos, start_line, start_level)
  local folds = {}
  local current_line = start_line
  local prev_level = start_level
  for indent, line in text:gmatch('([\t ]*)(.-)\r?\n') do
    if line ~= '' then
      local current_level = l.get_indent_amount(current_line)
      if current_level > prev_level then -- next level
        local i = current_line - 1
        while folds[i] and folds[i][2] == l.SC_FOLDLEVELWHITEFLAG do
          i = i - 1
        end
        if folds[i] then
          folds[i][2] = l.SC_FOLDLEVELHEADERFLAG -- low indent
        end
        folds[current_line] = { current_level } -- high indent
      elseif current_level < prev_level then -- prev level
        if folds[current_line - 1] then
          folds[current_line - 1][1] = prev_level -- high indent
        end
        folds[current_line] = { current_level } -- low indent
      else -- same level
        folds[current_line] = { prev_level }
      end
      prev_level = current_level
    else
      folds[current_line] = { prev_level, l.SC_FOLDLEVELWHITEFLAG }
    end
    current_line = current_line + 1
  end
  return folds
end

SciTE users note: do not use get_property for getting fold options from a .properties file because SciTE is not set up to forward them to your lexer. Instead, you can provide options that can be set at the top of the lexer.

Using with SciTE

Create a .properties file for your lexer and import it in either your SciTEUser.properties or SciTEGlobal.properties. The contents of the .properties file should contain:

file.patterns.[lexer_name]=[file_patterns]
lexer.$(file.patterns.[lexer_name])=[lexer_name]

where [lexer_name] is the name of your lexer (minus the .lua extension) and [file_patterns] is a set of file extensions matched to your lexer.

Please note any styling information in .properties files is ignored.

Using with Textadept

Put your lexer in your ~/.textadept/lexers/ directory. That way your lexer will not be overwritten when upgrading. Also, lexers in this directory override default lexers. (A user lua lexer would be loaded instead of the default lua lexer. This is convenient if you wish to tweak a default lexer to your liking.) Do not forget to add a mime-type for your lexer.

Optimization

Lexers can usually be optimized for speed by re-arranging tokens so that the most common ones are recognized first. Keep in mind the issue that was raised earlier: if you put similar tokens like identifiers before keywords, the latter will not be styled correctly.

Troubleshooting

Errors in lexers can be tricky to debug. Lua errors are printed to STDERR and _G.print() statements in lexers are printed to STDOUT.

Limitations

True embedded preprocessor language highlighting is not available. For most cases this will not be noticed, but code like

<div id="<?php echo $id; ?>">

or

<div <?php if ($odd) { echo 'class="odd"'; } ?>>

will not highlight correctly.

Performance

There might be some slight overhead when initializing a lexer, but loading a file from disk into Scintilla is usually more expensive.

On modern computer systems, I see no difference in speed between LPeg lexers and Scintilla’s C++ ones.

Risks

Poorly written lexers have the ability to crash Scintilla, so unsaved data might be lost. However, these crashes have only been observed in early lexer development, when syntax errors or pattern errors are present. Once the lexer actually starts styling text (either correctly or incorrectly; it does not matter), no crashes have occurred.

Acknowledgements

Thanks to Peter Odding for his lexer post on the Lua mailing list that inspired me, and of course thanks to Roberto Ierusalimschy for LPeg.


Fields


CLASS

Token type for class tokens.


COMMENT

Token type for comment tokens.


CONSTANT

Token type for constant tokens.


DEFAULT

Token type for default tokens.


ERROR

Token type for error tokens.


FUNCTION

Token type for function toeksn.


IDENTIFIER

Token type for identifier tokens.


KEYWORD

Token type for keyword tokens.


LABEL

Token type for label tokens.


NUMBER

Token type for number tokens.


OPERATOR

Token type for operator tokens.


PREPROCESSOR

Token type for preprocessor tokens.


REGEX

Token type for regex tokens.


SC_FOLDLEVELBASE

The initial (root) fold level.


SC_FOLDLEVELHEADERFLAG

Flag indicating the line is fold point.


SC_FOLDLEVELNUMBERMASK

Flag used with SCI_GETFOLDLEVEL(line) to get the fold level of a line.


SC_FOLDLEVELWHITEFLAG

Flag indicating that the line is blank.


STRING

Token type for string tokens.


TYPE

Token type for type tokens.


VARIABLE

Token type for variable tokens.


WHITESPACE

Token type for whitespace tokens.


alnum

Matches any alphanumeric character (A-Z, a-z, 0-9).


alpha

Matches any alphabetic character (A-Z, a-z).


any

Matches any single character.


ascii

Matches any ASCII character (0..127).


cntrl

Matches any control character (0..31).


dec_num

Matches a decimal number.


digit

Matches any digit (0-9).


extend

Matches any ASCII extended character (0..255).


float

Matches a floating point number.


graph

Matches any graphical character (! to ~).


hex_num

Matches a hexadecimal number.


integer

Matches a decimal, hexadecimal, or octal number.


lower

Matches any lowercase character (a-z).


newline

Matches any newline characters.


nonnewline

Matches any non-newline character.


nonnewline_esc

Matches any non-newline character excluding newlines escaped with \\.


oct_num

Matches an octal number.


print

Matches any printable character (space to ~).


punct

Matches any punctuation character not alphanumeric (! to /, : to @, [ to ', { to ~).


space

Matches any whitespace character (\t, \v, \f, \n, \r, space).


style_class

Style typically used for class definitions.


style_comment

Style typically used for code comments.


style_constant

Style typically used for constants.


style_definition

Style typically used for definitions.


style_embedded

Style typically used for embedded code.


style_error

Style typically used for erroneous syntax.


style_function

Style typically used for function definitions.


style_identifier

Style typically used for identifier words.


style_keyword

Style typically used for language keywords.


style_label

Style typically used for labels.


style_nothing

Style typically used for no styling.


style_number

Style typically used for numbers.


style_operator

Style typically used for operators.


style_preproc

Style typically used for preprocessor statements.


style_regex

Style typically used for regular expression strings.


style_string

Style typically used for strings.


style_tag

Style typically used for markup tags.


style_type

Style typically used for static types.


style_variable

Style typically used for variables.


style_whitespace

Style typically used for whitespace.


upper

Matches any uppercase character (A-Z).


word

Matches a typical word starting with a letter or underscore and then any alphanumeric or underscore characters.


xdigit

Matches any hexadecimal digit (0-9, A-F, a-f).


Functions


color (r, g, b)

Creates a Scintilla color.

Parameters:

Usage:


delimited_range (chars, escape, end_optional, balanced, forbidden)

Creates an LPeg pattern that matches a range of characters delimitted by a specific character(s). This can be used to match a string, parenthesis, etc.

Parameters:

Usage:


embed_lexer (parent, child, start_rule, end_rule)

Embeds a child lexer language in a parent one.

Parameters:

Usage:


fold (text, start_pos, start_line, start_level)

Folds the given text. Called by LexLPeg.cxx; do not call from Lua. If the current lexer has no _fold function, folding by indentation is performed if the ‘fold.by.indentation’ property is set.

Parameters:

Return:


fold_line_comments (prefix)

Returns a fold function that folds consecutive line comments. This function should be used inside the lexer’s _foldsymbols table.

Parameters:

Usage:


get_fold_level (line_number)

Returns the fold level for a given line. This level already has SC_FOLDLEVELBASE added to it, so you do not need to add it yourself.

Parameters:


get_indent_amount (line)

Returns the indent amount of text for a given line.

Parameters:


get_property (key, default)

Returns an integer property value for a given key.

Parameters:


get_style_at (pos)

Returns the string style name and style number at a given position.

Parameters:


lex (text, init_style)

Lexes the given text. Called by LexLPeg.cxx; do not call from Lua. If the lexer has a _LEXBYLINE flag set, the text is lexed one line at a time. Otherwise the text is lexed as a whole.

Parameters:


load (lexer_name)

Initializes the specified lexer.

Parameters:


nested_pair (start_chars, end_chars, end_optional)

Similar to delimited_range(), but allows for multi-character delimitters. This is useful for lexers with tokens such as nested block comments. With single-character delimiters, this function is identical to delimited_range(start_chars..end_chars, nil, end_optional, true).

Parameters:

Usage:


starts_line (patt)

Creates an LPeg pattern from a given pattern that matches the beginning of a line and returns it.

Parameters:

Usage:


style (style_table)

Creates a Scintilla style from a table of style properties.

Parameters:

Usage:

See also:


token (name, patt)

Creates an LPeg capture table index with the name and position of the token.

Parameters:

Usage:


word_match (words, word_chars, case_insensitive)

Creates an LPeg pattern that matches a set of words.

Parameters:

Usage:


Tables


_EMBEDDEDRULES

Set of rules for an embedded lexer. For a parent lexer name, contains child’s start_rule, token_rule, and end_rule patterns.


_RULES

List of rule names with associated LPeg patterns for a specific lexer. It is accessible to other lexers for embedded lexer applications.


colors

Table of common colors for a theme. This table should be redefined in each theme.