Go to content Go to navigation Go to search

Parsing HTML · Jun 14, 03:49 PM by Dylan Doxey

Did you every think to yourself, "I wish I could split this HTML document up into an array of tokens with descriptive keys."?

Well, it's occurred to me. So here's what I came up with.

package Dox::Parser;

use strict;
use warnings;
{
    use Carp;
    use File::Slurp qw( slurp );
}

my (%TYPE_REGEX_FOR,@TYPES);
{
    use Readonly;

    # HTML identifiers may have : or - such as xml:lang or http-equiv.
    # No recognition of mixed case HTML identifiers such as <Body> or <Title>.
    my $ident_re  = qr{ (?: [a-z:-]+ | [A-Z:-]+ ) }xms;

    # Quoted strings may contain backlash escaped quotes
    my $q_str_re  = qr{ ' (?: [\\]['] | [^'] )* ' }xms;
    my $qq_str_re = qr{ " (?: [\\]["] | [^"] )* " }xms;

    Readonly %TYPE_REGEX_FOR => (
        terminal_tag   => qr{ ( < \s* / \s* $ident_re [^>]* > ) }xms,
        begin_tag      => qr{ ( < \s* $ident_re ) }xms,
        end_tag        => qr{ ( /? > ) }xms,
        template_code  => qr{ ( \[% \s* .+? \s* %\] ) }xms,
        open_comment   => qr{ ( <!-- ) }xms,
        close_comment  => qr{ ( --> ) }xms,
        open_doctype   => qr{ ( <!DOCTYPE ) \s }xms,
        attribute      => qr{ ( $ident_re \s*=\s* (?: $q_str_re | $qq_str_re ) ) [\s/>] }xms,
        html_word      => qr{ ( $ident_re ) \s }xms,
        quoted_string  => qr{ ( $q_str_re | $qq_str_re ) }xms,
        whitespace     => qr{ ( \s+ ) }xms,
        content        => qr{ ( [^<]+? ) (?: \[ [%] | [<] ) }xms,
    );

    # The priority of types in evaluating
    # the leading characters of the HTML string.
    Readonly @TYPES => qw(
        open_doctype
        open_comment
        close_comment
        whitespace
        terminal_tag
        begin_tag
        end_tag
        attribute
        html_word
        quoted_string
        template_code
        content
    );
}

sub new {
    my ($class,$filename) = @_;

    croak "can't find $filename\n"
        if !stat $filename;

    # Memory usage concerns? Sorry. :(
    my $html = slurp( $filename );

    my $self = bless {
        html       => $html,
        last_token => [],
    }, $class;

    return $self;
}

sub next_token {
    my $self = shift;

    my $html = $self->{html};

    for my $type (@TYPES) {

        my $regex = $TYPE_REGEX_FOR{$type};

        if ( $html =~ m/\A $regex /xms ) {

            my $token = $1;

            $self->{html} = substr $html, length $token;

            push @{ $self->{last_token} }, $token;

            return { type => $type, token => $token, };
        }
    }
    return;
}

sub push_back {
    my ($self,$token) = @_;

    $self->{html} = $token . $self->{html};

    return length $token;
}

1;

Here's a little program, which I like to call parser_tester.pl, which demonstrates this baby in action.

#!/usr/bin/perl

use strict;
use warnings;
{
    use lib qw( . );
    use Dox::Parser;
    use File::Slurp qw( write_file );
    use Term::ANSIColor qw( :constants );
}

my %filename = (
    before => 'document.html',
    after  => 'parsed_document.html',
);

my $document_text = "";
my $parser = Dox::Parser->new( $filename{before} );

while ( my $token_rh = $parser->next_token() ) {

    print "{" . $token_rh->{token} . "}";
    print GREEN, "(" . $token_rh->{type} . ")", RESET;
    print "\n";

    $document_text .= $token_rh->{token};
}

write_file( $filename{after}, $document_text );

print RED, BOLD, "\ncreated $filename{after}\n\n", RESET;

1;

When you run parser_tester.pl you'll get an enumeration of the parsed HTML tokens, each in curly brackets, and the named token type in green (thanks to Term::ANSIColor). This program assumes you've got your sample HTML in document.html, and it will subsequently create parsed_document.html. The two files ought to be identical, which indicates the parser successfully identified all of the tokens and didn't forget anything.


Coming next: Dox::FSA -- a Finite State Automaton module which can be applied to make sense of the token stream so you can do correct and useful modifications to the HTML document.

Commenting is closed for this article.