Go to content Go to navigation Go to search

Parsing HTML · 20 days ago by Dylan Doxey

Did you every think to yourself, "I wish I could split this HTML document up into an array of tokens with descriptive keys."?

Well, it's occurred to me. So here's what I came up with.

package Dox::Parser;

use strict;
use warnings;
{
    use Carp;
    use File::Slurp qw( slurp );
}

my (%TYPE_REGEX_FOR,@TYPES);
{
    use Readonly;

    # HTML identifiers may have : or - such as xml:lang or http-equiv.
    # No recognition of mixed case HTML identifiers such as <Body> or <Title>.
    my $ident_re  = qr{ (?: [a-z:-]+ | [A-Z:-]+ ) }xms;

    # Quoted strings may contain backlash escaped quotes
    my $q_str_re  = qr{ ' (?: [\\]['] | [^'] )* ' }xms;
    my $qq_str_re = qr{ " (?: [\\]["] | [^"] )* " }xms;

    Readonly %TYPE_REGEX_FOR => (
        terminal_tag   => qr{ ( < \s* / \s* $ident_re [^>]* > ) }xms,
        begin_tag      => qr{ ( < \s* $ident_re ) }xms,
        end_tag        => qr{ ( /? > ) }xms,
        template_code  => qr{ ( \[% \s* .+? \s* %\] ) }xms,
        open_comment   => qr{ ( <!-- ) }xms,
        close_comment  => qr{ ( --> ) }xms,
        open_doctype   => qr{ ( <!DOCTYPE ) \s }xms,
        attribute      => qr{ ( $ident_re \s*=\s* (?: $q_str_re | $qq_str_re ) ) [\s/>] }xms,
        html_word      => qr{ ( $ident_re ) \s }xms,
        quoted_string  => qr{ ( $q_str_re | $qq_str_re ) }xms,
        whitespace     => qr{ ( \s+ ) }xms,
        content        => qr{ ( [^<]+? ) (?: \[ [%] | [<] ) }xms,
    );

    # The priority of types in evaluating
    # the leading characters of the HTML string.
    Readonly @TYPES => qw(
        open_doctype
        open_comment
        close_comment
        whitespace
        terminal_tag
        begin_tag
        end_tag
        attribute
        html_word
        quoted_string
        template_code
        content
    );
}

sub new {
    my ($class,$filename) = @_;

    croak "can't find $filename\n"
        if !stat $filename;

    # Memory usage concerns? Sorry. :(
    my $html = slurp( $filename );

    my $self = bless {
        html       => $html,
        last_token => [],
    }, $class;

    return $self;
}

sub next_token {
    my $self = shift;

    my $html = $self->{html};

    for my $type (@TYPES) {

        my $regex = $TYPE_REGEX_FOR{$type};

        if ( $html =~ m/\A $regex /xms ) {

            my $token = $1;

            $self->{html} = substr $html, length $token;

            push @{ $self->{last_token} }, $token;

            return { type => $type, token => $token, };
        }
    }
    return;
}

sub push_back {
    my ($self,$token) = @_;

    $self->{html} = $token . $self->{html};

    return length $token;
}

1;

Here's a little program, which I like to call parser_tester.pl, which demonstrates this baby in action.

#!/usr/bin/perl

use strict;
use warnings;
{
    use lib qw( . );
    use Dox::Parser;
    use File::Slurp qw( write_file );
    use Term::ANSIColor qw( :constants );
}

my %filename = (
    before => 'document.html',
    after  => 'parsed_document.html',
);

my $document_text = "";
my $parser = Dox::Parser->new( $filename{before} );

while ( my $token_rh = $parser->next_token() ) {

    print "{" . $token_rh->{token} . "}";
    print GREEN, "(" . $token_rh->{type} . ")", RESET;
    print "\n";

    $document_text .= $token_rh->{token};
}

write_file( $filename{after}, $document_text );

print RED, BOLD, "\ncreated $filename{after}\n\n", RESET;

1;

When you run parser_tester.pl you'll get an enumeration of the parsed HTML tokens, each in curly brackets, and the named token type in green (thanks to Term::ANSIColor). This program assumes you've got your sample HTML in document.html, and it will subsequently create parsed_document.html. The two files ought to be identical, which indicates the parser successfully identified all of the tokens and didn't forget anything.


Coming next: Dox::FSA -- a Finite State Automaton module which can be applied to make sense of the token stream so you can do correct and useful modifications to the HTML document.

Knowing Your File System · 35 days ago by Dylan Doxey

For a quick assessment of your drive space distribution and usage use the df command.

dylan@dev.doxey.org$: ~ df -h
Filesystem        Size  Used Avail Use% Mounted on
/dev/mapper/root   15G   14G  661M  96% /
varrun            2.0G   56K  2.0G   1% /var/run
varlock           2.0G     0  2.0G   0% /var/lock
udev              2.0G   40K  2.0G   1% /dev
devshm            2.0G     0  2.0G   0% /dev/shm
/dev/sda1         237M   24M  201M  11% /boot
/dev/mapper/home  440G  7.8G  410G   2% /home

The -h switch indicates human readable mode.
Gosh, looks like I ought to move some of my junk under /home.


For a more granular display of where the bulk of your stuff is, use the du command.

dylan@dev.doxey.org$: ~ du -h --max-depth=1
52K	./.subversion
7.6G	./rep
9.5M	./sandbox
45M	./.cpan
4.0K	./.gnupg
64K	./bin
128K	./.vim
7.7G	.

Again, the -h switch gives you the easier to read numeric values.
The --max-depth option let's you control the depth of the display. The default is unlimited depth.

My Other xorg.conf · 67 days ago by Dylan Doxey

This is the xorg.conf from my workstation at home.

# xorg.conf (X.Org X Window System server configuration file)
#
# You should use dexconf or another such tool for creating a "real" xorg.conf
# For example:
#   sudo dpkg-reconfigure -phigh xserver-xorg

Section "Module"
	Load        "glx"
	Load        "v4l"
EndSection

Section "Monitor"
	Identifier  "Primary Monitor"
	Vendorname  "BenQ"
	ModelName   "FP202W"
	HorizSync   31-81
	VertRefresh 56-76
	Option      "DPMS"
	UseModes    "BenQ Modes"
	Gamma       1.0
EndSection

Section "Monitor"
	Identifier  "Secondary Monitor"
	Vendorname  "BenQ"
	ModelName   "FP202W"
	HorizSync   31-81
	VertRefresh 56-76
	Option      "DPMS"
	UseModes    "BenQ Modes"
	Gamma       1.0
EndSection

Section "Screen"
	Identifier    "Primary Screen"
	Device        "nVidia GeForce"
	Monitor       "Primary Monitor"
	DefaultDepth	24
	SubSection "Display"
		Depth    24
		Modes    "1680x1050" "1600x1024" "1600x1000" "1400x1050" "1280x1024" "1440x900" "1280x960" "1366x768" "1280x800" "1152x864" "1280x768" "1024x768" "1280x600" "1024x600" "800x600" "768x576" "640x480"
	EndSubSection
EndSection

Section "Screen"
	Identifier    "Secondary Screen"
	Device        "nVidia GeForce"
	Monitor       "Secondary Monitor"
	Defaultdepth  24
	SubSection "Display"
		Depth    24
		Modes    "1680x1050" "1600x1024" "1600x1000" "1400x1050" "1280x1024" "1440x900" "1280x960" "1366x768" "1280x800" "1152x864" "1280x768" "1024x768" "1280x600" "1024x600" "800x600" "768x576" "640x480"
	EndSubSection
EndSection

Section "Device"
	Identifier "nVidia GeForce"
	Boardname  "nVidia GeForce 7 Series"
	BusId      "PCI:02:00:0"
	Screen     0
	Vendorname "NVIDIA"
	Option     "TwinView"            "true"
	Option     "MetaModes"           "1680x1050,1680x1050"
	Option     "HorizSync"           "DFP-0: 31-81;  DFP-1: 31-81"
	Option     "VertRefresh"         "DFP-0: 56-76;  DFP-1: 56-76"
	Option     "TwinViewOrientation" "DFP-1 LeftOf DFP-0"
	Option     "ConnectedMonitor"    "DFP-0,DFP-1"
	Driver     "nvidia"
	Option     "NoLogo"             "True"
EndSection

Section "ServerLayout"
	Identifier "Default Layout"
	Screen     0 "Primary Screen" 0 0
	Screen     1 "Secondary Screen" RightOf "Primary Screen"
EndSection

Section "ServerFlags"
	Option "DefaultServerLayout" "Default Layout"
	Option "Xinerama"            "false"
EndSection

Section "Modes"
	Identifier "BenQ Modes"
	Modeline	"1680x1050" 119.00 1680 1728 1760 1840 1050 1053 1059 1080
	Modeline	"1680x1050" 184.27 1680 1792 1976 2272 1050 1051 1054 1096
	Modeline	"1680x1050" 181.61 1680 1792 1976 2272 1050 1051 1054 1095
	Modeline	"1680x1050" 178.96 1680 1792 1976 2272 1050 1051 1054 1094
	Modeline	"1600x1024" 171.97 1600 1712 1888 2176 1024 1025 1028 1068
	Modeline	"1600x1024" 168.40 1600 1704 1880 2160 1024 1025 1028 1068
	Modeline	"1600x1024" 165.94 1600 1704 1880 2160 1024 1025 1028 1067
	Modeline	"1600x1000" 166.71 1600 1704 1880 2160 1000 1001 1004 1043
	Modeline	"1600x1000" 164.46 1600 1704 1880 2160 1000 1001 1004 1043
	Modeline	"1600x1000" 162.05 1600 1704 1880 2160 1000 1001 1004 1042
	Modeline	"1400x1050" 153.77 1400 1496 1648 1896 1050 1051 1054 1096
	Modeline	"1400x1050" 151.56 1400 1496 1648 1896 1050 1051 1054 1095
	Modeline	"1400x1050" 149.34 1400 1496 1648 1896 1050 1051 1054 1094
	Modeline	"1280x1024" 136.57 1280 1368 1504 1728 1024 1025 1028 1068
	Modeline	"1280x1024" 134.72 1280 1368 1504 1728 1024 1025 1028 1068
	Modeline	"1280x1024" 132.75 1280 1368 1504 1728 1024 1025 1028 1067
	Modeline	"1440x900" 134.52 1440 1536 1688 1936 900 901 904 939
	Modeline	"1440x900" 132.71 1440 1536 1688 1936 900 901 904 939
	Modeline	"1440x900" 130.75 1440 1536 1688 1936 900 901 904 938
	Modeline	"1280x960" 128.13 1280 1368 1504 1728 960 961 964 1002
	Modeline	"1280x960" 126.27 1280 1368 1504 1728 960 961 964 1001
	Modeline	"1280x960" 124.54 1280 1368 1504 1728 960 961 964 1001
	Modeline	"1366x768" 107.78 1368 1448 1592 1816 768 769 772 802
	Modeline	"1366x768" 106.19 1368 1448 1592 1816 768 769 772 801
	Modeline	"1366x768" 104.73 1368 1448 1592 1816 768 769 772 801
	Modeline	"1280x800" 105.78 1280 1360 1496 1712 800 801 804 835
	Modeline	"1280x800" 104.35 1280 1360 1496 1712 800 801 804 835
	Modeline	"1280x800" 102.80 1280 1360 1496 1712 800 801 804 834
	Modeline	"1152x864" 103.59 1152 1224 1352 1552 864 865 868 902
	Modeline	"1152x864" 102.08 1152 1224 1352 1552 864 865 868 901
	Modeline	"1152x864" 99.64 1152 1224 1344 1536 864 865 868 901
	Modeline	"1280x768" 101.60 1280 1360 1496 1712 768 769 772 802
	Modeline	"1280x768" 99.17 1280 1352 1488 1696 768 769 772 801
	Modeline	"1280x768" 97.81 1280 1352 1488 1696 768 769 772 801
	Modeline	"1024x768" 80.71 1024 1080 1192 1360 768 769 772 802
	Modeline	"1024x768" 79.52 1024 1080 1192 1360 768 769 772 801
	Modeline	"1024x768" 78.43 1024 1080 1192 1360 768 769 772 801
	Modeline	"1280x600" 77.82 1280 1344 1480 1680 600 601 604 626
	Modeline	"1280x600" 76.04 1280 1336 1472 1664 600 601 604 626
	Modeline	"1280x600" 75.00 1280 1336 1472 1664 600 601 604 626
	Modeline	"1024x600" 62.26 1024 1080 1184 1344 600 601 604 626
	Modeline	"1024x600" 61.42 1024 1080 1184 1344 600 601 604 626
	Modeline	"1024x600" 59.86 1024 1072 1176 1328 600 601 604 626
	Modeline	"800x600" 48.18 800 840 920 1040 600 601 604 626
	Modeline	"800x600" 47.53 800 840 920 1040 600 601 604 626
	Modeline	"800x600" 46.87 800 840 920 1040 600 601 604 626
	Modeline	"768x576" 44.83 768 808 888 1008 576 577 580 601
	Modeline	"768x576" 43.52 768 800 880 992 576 577 580 601
	Modeline	"768x576" 42.93 768 800 880 992 576 577 580 601
	Modeline	"640x480" 30.25 640 664 728 816 480 481 484 501
	Modeline	"640x480" 29.84 640 664 728 816 480 481 484 501
	Modeline	"640x480" 29.43 640 664 728 816 480 481 484 501
EndSection

Jaunty Jackalope Coming Soon · 72 days ago by Dylan Doxey

Countdown to Xubuntu (xubuntu.org) 9.04, by Pasi Lallinaho

I did a fresh install on my Acer TravelMate 4200.

My first impressions:

More to come as the impulses to write about it come.

My latest xorg.conf · 79 days ago by Dylan Doxey

I had to basically write this from scratch when I recently did a fresh Kubuntu install. Let's make sure that doesn't happen again.


#  xorg.conf

Section "Module"
        Load    "glx"
        Load    "kbd"
        Load    "mouse"
EndSection             

Section "InputDevice"
        Identifier      "The Keyboard"
        Driver          "kbd"
        Option          "CoreKeyboard"
        Option          "XkbRules" "xorg"
        Option          "XkbModel" "pc105"
        Option          "XkbLayout" "us"
EndSection

Section "InputDevice"
        Identifier      "The Mouse"
        Driver          "mouse"
        Option          "CorePointer"
EndSection

Section "Monitor"
        Identifier      "Left Monitor"
        VendorName      "DELL"
        ModelName       "DELL 1907FP"
        HorizSync       30.0 - 81.0
        VertRefresh     56.0 - 76.0
EndSection                                                                                                                                                                                                      

Section "Monitor"
        Identifier      "Right Monitor"
        VendorName      "DELL"         
        ModelName       "DELL 1907FP"  
        HorizSync       30.0 - 81.0    
        VertRefresh     56.0 - 76.0    
EndSection                             

Section "Screen"
        Identifier      "Right Screen"
        Monitor         "Right Monitor"
        Device          "Video Card A"
        Option          "TwinView" "True"
        DefaultDepth    24               
EndSection                               

Section "Screen"
        Identifier      "Left Screen"
        Monitor         "Left Monitor"
        Device          "Video Card B" 
        Option          "TwinView" "True"
EndSection                               

Section "Device"
        Identifier      "Video Card A"
        BusID           "PCI:1:0:0"
        Screen          0
        VendorName      "nVidia Corporation"
        BoardName       "GeForce 7300 LE"
        Driver  "nvidia"
        Option  "NoLogo"        "True"
EndSection

Section "Device"
        Identifier      "Video Card B"
        BusID           "PCI:1:0:0"
        Screen          1
        VendorName      "nVidia Corporation"
        BoardName       "GeForce 7300 LE"
        Driver  "nvidia"
        Option  "NoLogo"        "True"
EndSection

Section "ServerLayout"
        Identifier      "Default Layout"
        Screen          0 "Right Screen"
        Screen          1 "Left Screen" LeftOf "Right Screen"
        InputDevice     "The Keyboard"  "CoreKeyboard"
        InputDevice     "The Mouse"     "CorePointer"
EndSection

Section "ServerFlags"
        Option                  "Xinerama"      "0"
        DefaultServerLayout     "Default Layout"
EndSection

Charile Rose · 108 days ago by Dylan Doxey

I've taken quite a liking to watching Charlie Rose.

A conversation with Reid Hoffman of LinkedIn
http://www.charlierose.com/view/interview/10128

A conversation with Marissa Mayer, V.P. of Search Product and User Experience, Google
http://www.charlierose.com/view/interview/10129

A conversation with entrepreneur and software engineer Marc Andreessen
http://www.charlierose.com/view/interview/10093

A conversation with Evan Williams, Co-founder of Twitter.com
http://www.charlierose.com/view/interview/10118

A conversation with Jen-Hsun Huang, CEO Nvidia
http://www.charlierose.com/view/interview/10060

A conversation with Chris DeWolfe And Tom Anderson, founders of Myspace.com
http://www.charlierose.com/view/interview/10054

A conversation with Arianna Huffington
http://www.charlierose.com/view/interview/9705

A conversation with Eric Schmidt, CEO of Google
http://www.charlierose.com/view/interview/10131

A conversation with Jeff Bezos, Amazon.com
http://www.charlierose.com/view/interview/10105

Like ZipRealty.com... almost? · 197 days ago by Dylan Doxey

I like ziprealty.com because it's not trying to impress you with all the wistles, bells and gadgets.

You can set some criterion, search, and look at results.

But I want more. Well, actually I want less. I want fewer things distracting me and cluttering up the page.

So I wrote this GreaseMonkey script to help out.

ziprealty_gmaps.user.js

Features:

ziprealty.com on greasemonkey

ZipRealty give you enough information that you can nearly do your agent's job for him/her. And why not? I enjoy browsing the houses and composing my weekend visit lists.

Happy house shopping!

Extended Characters in Your JavaScript · 201 days ago by Dylan Doxey

So, there you are happily coding away on your web application for your Japanese audience. You think you've buttoned it all up, and you'll go ahead and give it a courtesy end user test before you launch it. just to be sure.

And there it is, the dreaded corrupt CJK characters in your JavaScript.

corrupt Japanese text in JavaScript

Surely you were going for something more like this.

clear Japanese text in JavaScript

Do not fret! There is a reliable solution.

Generally you might be inclined to do this:

    var characters = prompt( 'こんにちは、世界的', '' );

And why the heck not?

This solution is fine, provided there is no confusion about character encoding anywere between your text editor and the web client's browser software.
This confusion could arise in a number of places. To mention a few:

If at any point in this chain of custody something makes an assumption about the encoding, then your wide characters may become corrupt.

The solution is to use the JavaScript Unicode escaped version of the characters that go beyond the ASCII range.

    var characters = prompt( '\u3053\u3093\u306b\u3061\u306f\u3001\u4e16\u754c\u7684', '' );

Sweet! Problem solved. Now we can all go back to our stations and continue having been edified with this new insight!

Well, not quite.

Who really knows the Unicode values of the CJK text they're working with? Surely, no one.

Yes, that's right, it's another opportunity to write some code.

#!/usr/bin/perl

use strict;
use Encode qw( decode_utf8 );

if ( !@ARGV ) {
    print "\nUsage:\n  $0 some string to encode\n\n";
    exit; 
}   

my $js_encoded = "";

my $string = decode_utf8( join ' ', @ARGV );

for my $char (split //, $string) {

    my $unicode = sprintf '%0.4x', ord $char;
    
    $js_encoded .= '\u' . $unicode;
}   

print "\nJS Encoded:\n";
print "    $js_encoded\n";

And there you have it. Now all you need to do is run each snippet of CJK text you want to include in your JavaScript through this program.

dylan@dev.doxey.org$: ~ ./js_encode.pl こんにちは、世界的

JS Encoded:
    \u3053\u3093\u306b\u3061\u306f\u3001\u4e16\u754c\u7684

Happy computing!

Filtering Columns · 203 days ago by Dylan Doxey

I love grep, sed, and especially awk.

However, I find that I generally only use awk for one thing -- filtering by columns.

Suppose I want to see the login names and their home directories for the users on my system.

cat /etc/passwd | awk -F: '{print $1" "$6}'

Beautiful. Works perfect. Who could ask for more?

Me.

I really like quotation marks and curly brackets. I really do. They make code look awesome. They're also good exercise for my keyboarding skills. But they're a little awkward to type, over and over again.

Yes, it's a little tiring typing out all that awesome looking code. Sometimes I wish my command lining around could go just a little more efficiently.

So, how about we make code written in modules and scripts really awesome looking with sweet punctuation marks and stuff, and then the code we're slinging at the command prompt be a little more streamlined?

Here's what I have in mind as an alternative to the sweet awk program above.

cat /etc/passwd | cols -F: 1 6

Yep, I guess that completes the design phase of this project.


I think it should go a little something like this.

#!/usr/bin/perl

use strict;

my $usage = "\nUsage:\n  \$ $0 [--first|--last|n|...] [-Fx]\n\n"
    . "Where n is the 1 indexed column number and x "
    . "is some character (or Perl regex) to split on.\n"
    . "You may specify multiple columns.\n"
    . "If not specified, n is 1 and x is \\\\s.\n\n";

my $split_regex;
my @cols;

# Note, if an argument is 0, then the arg reading loop ends.
ARG:
while ( my $arg = shift @ARGV ) {
    
    if ( $arg =~ m/\A -F \s* (.+)? \z/msx ) {
        $arg = $1 ? $1 : shift @ARGV;
        $split_regex = qr/$arg/;
        next ARG;
    }
    
    $arg =~ s/--first/0/;
    $arg =~ s/--last/-1/;
    
    if ( $arg eq '--help' || $arg !~ m/\A -? \d+ \z/msx ) {
        print $usage;
        exit;
    }
    
    # reduce to zero index
    $arg--;
    
    push @cols, $arg;
}

# Default split regex
$split_regex = qr/\s/
    if !$split_regex;

# Default index
@cols = (0) 
    if !@cols;

# Do the work
for my $line (<STDIN>) {
    
    print '' . ( join ' ', ( split $split_regex, $line )[@cols] ) . "\n";
}

Use the force!

Sony IFX-125 Memory Stick slot and Board - A-8066-718-A · 214 days ago by Dylan Doxey

I just pulled this "Sony IFX-125 Memory Stick slot and Board - A-8066-718-A" out of my old Vaio SR33K. I never used it, and it just takes up space in the case.

Sony IFX-125 Memory Stick slot and Board - A-8066-718-A

I thought to myself, "Self, maybe someone really wants one of these."

Previous