Building a custom language in tempest/highlight
Yesterday, I wrote about the why of making a new syntax highlighter. Today I want to write about the how.
Let's explain how tempest/highlight
works by implementing a new language — Blade is a good candidate. It looks something like this:
@if(! empty($items)) <div class="container"> Items: {{ count($items) }}. </div> @endslot
In order to build such a new language, you need to understand three concepts of how code is highlighted: patterns, injections, and languages.
# 1. Patterns
A pattern represents part of code that should be highlighted. A pattern can target a single keyword like return
or class
, or it could be any part of code, like for example a comment: /* this is a comment */
or an attribute: #[Get(uri: '/')]
.
Each pattern is represented by a simple class that provides a regex pattern, and a TokenType
. The regex pattern is used to match relevant content to this specific pattern, while the TokenType
is an enum value that will determine how that specific pattern is colored.
Here's an example of a simple pattern to match the namespace of a PHP file:
use Tempest\Highlight\IsPattern; use Tempest\Highlight\Pattern; use Tempest\Highlight\Tokens\TokenType; final readonly class NamespacePattern implements Pattern { use IsPattern; public function getPattern(): string { return 'namespace (?<match>[\w\\\\]+)'; } public function getTokenType(): TokenType { return TokenType::TYPE; } }
Note that each pattern must include a regex capture group that's named match
. The content that matched within this group will be highlighted.
For example, this regex namespace (?<match>[\w\\\\]+)
says that every line starting with namespace
should be taken into account, but only the part within the named group (?<match>…)
will actually be colored. In practice that means that the namespace name matching [\w\\\\]+
, will be colored.
Yes, you'll need some basic knowledge of regex. Head over to https://regexr.com/ if you need help, or take a look at the existing patterns in this repository.
In summary:
- Pattern classes provide a regex pattern that matches parts of code.
- Those regexes should contain a group named
match
, which is written like so(?<match>…)
, this group represents the code that will actually be highlighted. - Finally, a pattern provides a
TokenType
, which is used to determine the highlight style for the specific match.
# 2. Injections
Once you've understood patterns, the next step is to understand injections. Injections are used to highlight different languages within one code block. For example: HTML could contain CSS, which should be styled properly as well.
An injection will tell the highlighter that it should treat a block of code as a different language. For example:
<div> <x-slot name="styles"> <style> body { background-color: red; } </style> </x-slot> </div>
Everything within <style></style>
tags should be treated as CSS. That's done by injection classes:
use Tempest\Highlight\Highlighter; use Tempest\Highlight\Injection; use Tempest\Highlight\IsInjection; final readonly class CssInjection implements Injection { use IsInjection; public function getPattern(): string { return '<style>(?<match>(.|\n)*)<\/style>'; } public function parseContent(string $content, Highlighter $highlighter): string { return $highlighter->parse($content, 'css'); } }
Just like patterns, an injection must provide a pattern. This pattern, for example, will match anything between style tags: <style>(?<match>(.|\n)*)<\/style>
.
The second step in providing an injection is to parse the matched content into another language. That's what the parseContent()
method is for. In this case, we'll get all code between the style tags that was matched with the named (?<match>…)
group, and parse that content as CSS instead of whatever language we're currently dealing with.
In summary:
- Injections provide a regex that matches a blob of code of language A, while in language B.
- Just like patterns, injection regexes should contain a group named
match
, which is written like so:(?<match>…)
. - Finally, an injection will use the highlighter to parse its matched content into another language.
# 3. Languages
The last concept to understand: languages are classes that bring patterns and injections together. Take a look at the HtmlLanguage
, for example:
class HtmlLanguage extends BaseLanguage { public function getInjections(): array { return [ ...parent::getInjections(), new PhpInjection(), new PhpShortEchoInjection(), new CssInjection(), new CssAttributeInjection(), ]; } public function getPatterns(): array { return [ ...parent::getPatterns(), new OpenTagPattern(), new CloseTagPattern(), new TagAttributePattern(), new HtmlCommentPattern(), ]; } }
This HtmlLanguage
class specifies the following things:
- PHP can be injected within HTML, both with the short echo tag
<?=
and longer<?php
tags - CSS can be injected as well, JavaScript support is still work in progress
- There are a bunch of patterns to highlight HTML tags properly
On top of that, it extends from BaseLanguage
. This is a language class that adds a bunch of cross-language injections, such as blurs and highlights. Your language doesn't need to extend from BaseLanguage
and could implement Language
directly if you want to.
With these three concepts in place, let's bring everything together to explain how you can add your own languages.
# Adding custom languages
So we're adding Blade support. We could create a new language class and start from scratch, but it'd probably be easier to extend an existing language, HtmlLanguage
is probably the best. Let create a new BladeLanguage
class that extends from HtmlLanguage
:
class BladeLanguage extends HtmlLanguage { public function getInjections(): array { return [ ...parent::getInjections(), ]; } public function getPatterns(): array { return [ ...parent::getPatterns(), ]; } }
With this class in place, we can start adding our own patterns and injections. Let's start with adding a pattern that matches all Blade keywords, which are always prepended with the @
sign. Let's add it:
final readonly class BladeKeywordPattern implements Pattern { use IsPattern; public function getPattern(): string { return '(?<match>\@[\w]+)\b'; } public function getTokenType(): TokenType { return TokenType::KEYWORD; } }
And register it in our BladeLanguage
class:
public function getPatterns(): array { return [ ...parent::getPatterns(), new BladeKeywordPattern(), ]; }
Next, there are a couple of places within Blade where you can write PHP code: within the @php
keyword, as well as within keyword brackets: @if (count(…))
. Let's write two injections for that:
final readonly class BladePhpInjection implements Injection { use IsInjection; public function getPattern(): string { return '\@php(?<match>(.|\n)*?)\@endphp'; } public function parseContent(string $content, Highlighter $highlighter): string { return $highlighter->parse($content, 'php'); } }
final readonly class BladeKeywordInjection implements Injection { use IsInjection; public function getPattern(): string { return '(\@[\w]+)\s?\((?<match>.*)\)'; } public function parseContent(string $content, Highlighter $highlighter): string { return $highlighter->parse($content, 'php'); } }
Let's add these to our BladeLanguage
class as well:
public function getInjections(): array { return [ ...parent::getInjections(), new BladePhpInjection(), new BladeKeywordInjection(), ]; }
Next, you can write {{ … }}
and {!! … !!}
to echo output. Whatever is between these brackets is also considered PHP, so, one more injection:
final readonly class BladeEchoInjection implements Injection { use IsInjection; public function getPattern(): string { return '({{|{!!)(?<match>.*)(}}|!!})'; } public function parseContent(string $content, Highlighter $highlighter): string { return $highlighter->parse($content, 'php'); } }
And, finally, you can write Blade comments like so: {{-- --}}
, this can be a simple pattern:
final readonly class BladeCommentPattern implements Pattern { use IsPattern; public function getPattern(): string { return '(?<match>\{\{\-\-(.|\n)*?\-\-\}\})'; } public function getTokenType(): TokenType { return TokenType::COMMENT; } }
With all of that in place, the only thing left to do is to add our language to the highlighter:
$highlighter->addLanguage('blade', new BladeLanguage());
And we're done! Blade support with just a handful of patterns and injections!
I think that the ability to extend from other languages and language injections are both really powerful to be able to quickly build new languages. Of course, you're free to send pull requests with support for additional languages as well! Take a look at the package's tests to learn how to write tests for patterns and injections.