Parsing HTML · Jun 14, 03:49 PM by Dylan Doxey
Did you every think to yourself, "I wish I could split this HTML document up into an array of tokens with descriptive keys."?
Well, it's occurred to me. So here's what I came up with.
package Dox::Parser;
use strict;
use warnings;
{
use Carp;
use File::Slurp qw( slurp );
}
my (%TYPE_REGEX_FOR,@TYPES);
{
use Readonly;
# HTML identifiers may have : or - such as xml:lang or http-equiv.
# No recognition of mixed case HTML identifiers such as <Body> or <Title>.
my $ident_re = qr{ (?: [a-z:-]+ | [A-Z:-]+ ) }xms;
# Quoted strings may contain backlash escaped quotes
my $q_str_re = qr{ ' (?: [\\]['] | [^'] )* ' }xms;
my $qq_str_re = qr{ " (?: [\\]["] | [^"] )* " }xms;
Readonly %TYPE_REGEX_FOR => (
terminal_tag => qr{ ( < \s* / \s* $ident_re [^>]* > ) }xms,
begin_tag => qr{ ( < \s* $ident_re ) }xms,
end_tag => qr{ ( /? > ) }xms,
template_code => qr{ ( \[% \s* .+? \s* %\] ) }xms,
open_comment => qr{ ( <!-- ) }xms,
close_comment => qr{ ( --> ) }xms,
open_doctype => qr{ ( <!DOCTYPE ) \s }xms,
attribute => qr{ ( $ident_re \s*=\s* (?: $q_str_re | $qq_str_re ) ) [\s/>] }xms,
html_word => qr{ ( $ident_re ) \s }xms,
quoted_string => qr{ ( $q_str_re | $qq_str_re ) }xms,
whitespace => qr{ ( \s+ ) }xms,
content => qr{ ( [^<]+? ) (?: \[ [%] | [<] ) }xms,
);
# The priority of types in evaluating
# the leading characters of the HTML string.
Readonly @TYPES => qw(
open_doctype
open_comment
close_comment
whitespace
terminal_tag
begin_tag
end_tag
attribute
html_word
quoted_string
template_code
content
);
}
sub new {
my ($class,$filename) = @_;
croak "can't find $filename\n"
if !stat $filename;
# Memory usage concerns? Sorry. :(
my $html = slurp( $filename );
my $self = bless {
html => $html,
last_token => [],
}, $class;
return $self;
}
sub next_token {
my $self = shift;
my $html = $self->{html};
for my $type (@TYPES) {
my $regex = $TYPE_REGEX_FOR{$type};
if ( $html =~ m/\A $regex /xms ) {
my $token = $1;
$self->{html} = substr $html, length $token;
push @{ $self->{last_token} }, $token;
return { type => $type, token => $token, };
}
}
return;
}
sub push_back {
my ($self,$token) = @_;
$self->{html} = $token . $self->{html};
return length $token;
}
1;
Here's a little program, which I like to call parser_tester.pl, which demonstrates this baby in action.
#!/usr/bin/perl
use strict;
use warnings;
{
use lib qw( . );
use Dox::Parser;
use File::Slurp qw( write_file );
use Term::ANSIColor qw( :constants );
}
my %filename = (
before => 'document.html',
after => 'parsed_document.html',
);
my $document_text = "";
my $parser = Dox::Parser->new( $filename{before} );
while ( my $token_rh = $parser->next_token() ) {
print "{" . $token_rh->{token} . "}";
print GREEN, "(" . $token_rh->{type} . ")", RESET;
print "\n";
$document_text .= $token_rh->{token};
}
write_file( $filename{after}, $document_text );
print RED, BOLD, "\ncreated $filename{after}\n\n", RESET;
1;
When you run parser_tester.pl you'll get an enumeration of the parsed HTML tokens, each in curly brackets, and the named token type in green (thanks to Term::ANSIColor). This program assumes you've got your sample HTML in document.html, and it will subsequently create parsed_document.html. The two files ought to be identical, which indicates the parser successfully identified all of the tokens and didn't forget anything.
Coming next: Dox::FSA -- a Finite State Automaton module which can be applied to make sense of the token stream so you can do correct and useful modifications to the HTML document.

Commenting is closed for this article.
