Building a custom language in tempest/highlight

Yesterday, I wrote about the why of making a new syntax highlighter. Today I want to write about the how.

Let's explain how tempest/highlight works by implementing a new language — Blade is a good candidate. It looks something like this:

@if(! empty($items))
    <div class="container">
        Items: {{ count($items) }}.
    </div>
@endslot

In order to build such a new language, you need to understand three concepts of how code is highlighted: patterns, injections, and languages.

# 1. Patterns

A pattern represents part of code that should be highlighted. A pattern can target a single keyword like return or class, or it could be any part of code, like for example a comment: /* this is a comment */ or an attribute: #[Get(uri: '/')].

Each pattern is represented by a simple class that provides a regex pattern, and a TokenType. The regex pattern is used to match relevant content to this specific pattern, while the TokenType is an enum value that will determine how that specific pattern is colored.

Here's an example of a simple pattern to match the namespace of a PHP file:

use Tempest\Highlight\IsPattern;
use Tempest\Highlight\Pattern;
use Tempest\Highlight\Tokens\TokenType;

final readonly class NamespacePattern implements Pattern
{
    use IsPattern;

    public function getPattern(): string
    {
        return 'namespace (?<match>[\w\\\\]+)';
    }

    public function getTokenType(): TokenType
    {
        return TokenType::TYPE;
    }
}

Note that each pattern must include a regex capture group that's named match. The content that matched within this group will be highlighted.

For example, this regex namespace (?<match>[\w\\\\]+) says that every line starting with namespace should be taken into account, but only the part within the named group (?<match>…) will actually be colored. In practice that means that the namespace name matching [\w\\\\]+, will be colored.

Yes, you'll need some basic knowledge of regex. Head over to https://regexr.com/ if you need help, or take a look at the existing patterns in this repository.

In summary:

# 2. Injections

Once you've understood patterns, the next step is to understand injections. Injections are used to highlight different languages within one code block. For example: HTML could contain CSS, which should be styled properly as well.

An injection will tell the highlighter that it should treat a block of code as a different language. For example:

<div>
    <x-slot name="styles">
        <style>
            body {
                background-color: red;
            }
        </style>
    </x-slot>
</div>

Everything within <style></style> tags should be treated as CSS. That's done by injection classes:

use Tempest\Highlight\Highlighter;
use Tempest\Highlight\Injection;
use Tempest\Highlight\IsInjection;

final readonly class CssInjection implements Injection
{
    use IsInjection;

    public function getPattern(): string
    {
        return '<style>(?<match>(.|\n)*)<\/style>';
    }

    public function parseContent(string $content, Highlighter $highlighter): string
    {
        return $highlighter->parse($content, 'css');
    }
}

Just like patterns, an injection must provide a pattern. This pattern, for example, will match anything between style tags: <style>(?<match>(.|\n)*)<\/style>.

The second step in providing an injection is to parse the matched content into another language. That's what the parseContent() method is for. In this case, we'll get all code between the style tags that was matched with the named (?<match>…) group, and parse that content as CSS instead of whatever language we're currently dealing with.

In summary:

# 3. Languages

The last concept to understand: languages are classes that bring patterns and injections together. Take a look at the HtmlLanguage, for example:

class HtmlLanguage extends BaseLanguage
{
    public function getInjections(): array
    {
        return [
            ...parent::getInjections(),
            new PhpInjection(),
            new PhpShortEchoInjection(),
            new CssInjection(),
            new CssAttributeInjection(),
        ];
    }

    public function getPatterns(): array
    {
        return [
            ...parent::getPatterns(),
            new OpenTagPattern(),
            new CloseTagPattern(),
            new TagAttributePattern(),
            new HtmlCommentPattern(),
        ];
    }
}

This HtmlLanguage class specifies the following things:

On top of that, it extends from BaseLanguage. This is a language class that adds a bunch of cross-language injections, such as blurs and highlights. Your language doesn't need to extend from BaseLanguage and could implement Language directly if you want to.

With these three concepts in place, let's bring everything together to explain how you can add your own languages.

# Adding custom languages

So we're adding Blade support. We could create a new language class and start from scratch, but it'd probably be easier to extend an existing language, HtmlLanguage is probably the best. Let create a new BladeLanguage class that extends from HtmlLanguage:

class BladeLanguage extends HtmlLanguage
{
    public function getInjections(): array
    {
        return [
            ...parent::getInjections(),
        ];
    }

    public function getPatterns(): array
    {
        return [
            ...parent::getPatterns(),
        ];
    }
}

With this class in place, we can start adding our own patterns and injections. Let's start with adding a pattern that matches all Blade keywords, which are always prepended with the @ sign. Let's add it:

final readonly class BladeKeywordPattern implements Pattern
{
    use IsPattern;

    public function getPattern(): string
    {
        return '(?<match>\@[\w]+)\b';
    }

    public function getTokenType(): TokenType
    {
        return TokenType::KEYWORD;
    }
}

And register it in our BladeLanguage class:

    public function getPatterns(): array
    {
        return [
            ...parent::getPatterns(),
            new BladeKeywordPattern(),
        ];
    }

Next, there are a couple of places within Blade where you can write PHP code: within the @php keyword, as well as within keyword brackets: @if (count(…)). Let's write two injections for that:

final readonly class BladePhpInjection implements Injection
{
    use IsInjection;

    public function getPattern(): string
    {
        return '\@php(?<match>(.|\n)*?)\@endphp';
    }

    public function parseContent(string $content, Highlighter $highlighter): string
    {
        return $highlighter->parse($content, 'php');
    }
}
final readonly class BladeKeywordInjection implements Injection
{
    use IsInjection;

    public function getPattern(): string
    {
        return '(\@[\w]+)\s?\((?<match>.*)\)';
    }

    public function parseContent(string $content, Highlighter $highlighter): string
    {
        return $highlighter->parse($content, 'php');
    }
}

Let's add these to our BladeLanguage class as well:

    public function getInjections(): array
    {
        return [
            ...parent::getInjections(),
            new BladePhpInjection(),
            new BladeKeywordInjection(),
        ];
    }

Next, you can write {{ … }} and {!! … !!} to echo output. Whatever is between these brackets is also considered PHP, so, one more injection:

final readonly class BladeEchoInjection implements Injection
{
    use IsInjection;

    public function getPattern(): string
    {
        return '({{|{!!)(?<match>.*)(}}|!!})';
    }

    public function parseContent(string $content, Highlighter $highlighter): string
    {
        return $highlighter->parse($content, 'php');
    }
}

And, finally, you can write Blade comments like so: {{-- --}}, this can be a simple pattern:

final readonly class BladeCommentPattern implements Pattern
{
    use IsPattern;

    public function getPattern(): string
    {
        return '(?<match>\{\{\-\-(.|\n)*?\-\-\}\})';
    }

    public function getTokenType(): TokenType
    {
        return TokenType::COMMENT;
    }
}

With all of that in place, the only thing left to do is to add our language to the highlighter:

$highlighter->addLanguage('blade', new BladeLanguage());

And we're done! Blade support with just a handful of patterns and injections!

I think that the ability to extend from other languages and language injections are both really powerful to be able to quickly build new languages. Of course, you're free to send pull requests with support for additional languages as well! Take a look at the package's tests to learn how to write tests for patterns and injections.

Noticed a tpyo? You can submit a PR to fix it. If you want to stay up to date about what's happening on this blog, you can subscribe to my mailing list: send an email to brendt@stitcher.io, and I'll add you to the list.