Timeline Taxi Out now: my sci-fi novel Timeline Taxi is published!

HTML 5 support in PHP 8.4

Even though HTML 5 has been around for over 16 years, PHP never had proper support for it. PHP does have \DOMDocument, which in theory should support HTML 4, but it isn't really HTML 4 compliant anymore.

So, yeah, classic PHP — right 😅? Well, we can laugh all we want, but let's take a moment to highlight new features that — albeit late — fix these quirks: PHP 8.4 is adding an HTML 5 compliant parser! In this post I'll go through the highlights of this new parser, and you can read with me through the whole RFC as well:

# Backwards compatible

One of the core requirements for this new parser is that it should be fully backwards compatible. That's why internals have chosen to make a completely new class — within a new namespace — to house the new HTML 5 parser. The old \DOMDocument class is left (mostly) alone. The only change in the old implementation is that \DOMDocument now extends the abstract \Dom\Document class, which is also the parent for the new, HTML 5 compliant implementation: \Dom\HTMLDocument.

If you want to use PHP's new HTML 5 parser, that's the one you need:

// HTML 5 compliant
$dom = \Dom\HTMLDocument::createFromString($html); 

While the old version is still available as usual:

// HTML 4-ish support
$oldDom = new \DOMDocument(); 
$oldDom->loadHTML($html);

# Constructing DOMs

One key difference you'll spot immediately is that the new implementation relies on static constructors instead of calling methods on the newly created object afterward. The new HTMLDocument class has three named constructors available:

HTMLDocument::createEmpty();
HTMLDocument::createFromFile($path);
HTMLDocument::createFromString($html);

These are their full signatures:

public static function createEmpty(string $encoding = "UTF-8"): HTMLDocument;
public static function createFromFile(string $path, int $options = 0, ?string $override_encoding = null): HTMLDocument;
public static function createFromString(string $source, int $options = 0, ?string $override_encoding = null): HTMLDocument;

For the $options variable, these options are available:

The $override_encoding variable is used to override the implicit encoding detection routines as determined by the HTML parser spec. This can be useful when the document is downloaded manually.

# DOM Objects

Note that using the new implementation will result in other types of value objects to be created as well. For example, instead of \DOMNode, you'll get \DOM\Node; instead of \DOMElement, you'll get \DOM\Element, etc. The RFC originally aimed to keep these objects the same between the old and new implementation, but there turned out to be too many differences. You can read all about them here.


Albeit a bit late, I think this is a very nice addition to PHP. I definitely have some usecases for it! What are your thoughts? You can leave them in the comments down below!