Parsing HTML · 20 days ago by Dylan Doxey
Did you every think to yourself, "I wish I could split this HTML document up into an array of tokens with descriptive keys."?
Well, it's occurred to me. So here's what I came up with.
package Dox::Parser;
use strict;
use warnings;
{
use Carp;
use File::Slurp qw( slurp );
}
my (%TYPE_REGEX_FOR,@TYPES);
{
use Readonly;
# HTML identifiers may have : or - such as xml:lang or http-equiv.
# No recognition of mixed case HTML identifiers such as <Body> or <Title>.
my $ident_re = qr{ (?: [a-z:-]+ | [A-Z:-]+ ) }xms;
# Quoted strings may contain backlash escaped quotes
my $q_str_re = qr{ ' (?: [\\]['] | [^'] )* ' }xms;
my $qq_str_re = qr{ " (?: [\\]["] | [^"] )* " }xms;
Readonly %TYPE_REGEX_FOR => (
terminal_tag => qr{ ( < \s* / \s* $ident_re [^>]* > ) }xms,
begin_tag => qr{ ( < \s* $ident_re ) }xms,
end_tag => qr{ ( /? > ) }xms,
template_code => qr{ ( \[% \s* .+? \s* %\] ) }xms,
open_comment => qr{ ( <!-- ) }xms,
close_comment => qr{ ( --> ) }xms,
open_doctype => qr{ ( <!DOCTYPE ) \s }xms,
attribute => qr{ ( $ident_re \s*=\s* (?: $q_str_re | $qq_str_re ) ) [\s/>] }xms,
html_word => qr{ ( $ident_re ) \s }xms,
quoted_string => qr{ ( $q_str_re | $qq_str_re ) }xms,
whitespace => qr{ ( \s+ ) }xms,
content => qr{ ( [^<]+? ) (?: \[ [%] | [<] ) }xms,
);
# The priority of types in evaluating
# the leading characters of the HTML string.
Readonly @TYPES => qw(
open_doctype
open_comment
close_comment
whitespace
terminal_tag
begin_tag
end_tag
attribute
html_word
quoted_string
template_code
content
);
}
sub new {
my ($class,$filename) = @_;
croak "can't find $filename\n"
if !stat $filename;
# Memory usage concerns? Sorry. :(
my $html = slurp( $filename );
my $self = bless {
html => $html,
last_token => [],
}, $class;
return $self;
}
sub next_token {
my $self = shift;
my $html = $self->{html};
for my $type (@TYPES) {
my $regex = $TYPE_REGEX_FOR{$type};
if ( $html =~ m/\A $regex /xms ) {
my $token = $1;
$self->{html} = substr $html, length $token;
push @{ $self->{last_token} }, $token;
return { type => $type, token => $token, };
}
}
return;
}
sub push_back {
my ($self,$token) = @_;
$self->{html} = $token . $self->{html};
return length $token;
}
1;
Here's a little program, which I like to call parser_tester.pl, which demonstrates this baby in action.
#!/usr/bin/perl
use strict;
use warnings;
{
use lib qw( . );
use Dox::Parser;
use File::Slurp qw( write_file );
use Term::ANSIColor qw( :constants );
}
my %filename = (
before => 'document.html',
after => 'parsed_document.html',
);
my $document_text = "";
my $parser = Dox::Parser->new( $filename{before} );
while ( my $token_rh = $parser->next_token() ) {
print "{" . $token_rh->{token} . "}";
print GREEN, "(" . $token_rh->{type} . ")", RESET;
print "\n";
$document_text .= $token_rh->{token};
}
write_file( $filename{after}, $document_text );
print RED, BOLD, "\ncreated $filename{after}\n\n", RESET;
1;
When you run parser_tester.pl you'll get an enumeration of the parsed HTML tokens, each in curly brackets, and the named token type in green (thanks to Term::ANSIColor). This program assumes you've got your sample HTML in document.html, and it will subsequently create parsed_document.html. The two files ought to be identical, which indicates the parser successfully identified all of the tokens and didn't forget anything.
Coming next: Dox::FSA -- a Finite State Automaton module which can be applied to make sense of the token stream so you can do correct and useful modifications to the HTML document.

Knowing Your File System · 35 days ago by Dylan Doxey
For a quick assessment of your drive space distribution and usage use the df command.
dylan@dev.doxey.org$: ~ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/root 15G 14G 661M 96% / varrun 2.0G 56K 2.0G 1% /var/run varlock 2.0G 0 2.0G 0% /var/lock udev 2.0G 40K 2.0G 1% /dev devshm 2.0G 0 2.0G 0% /dev/shm /dev/sda1 237M 24M 201M 11% /boot /dev/mapper/home 440G 7.8G 410G 2% /home
The -h switch indicates human readable mode.
Gosh, looks like I ought to move some of my junk under /home.
For a more granular display of where the bulk of your stuff is, use the du command.
dylan@dev.doxey.org$: ~ du -h --max-depth=1 52K ./.subversion 7.6G ./rep 9.5M ./sandbox 45M ./.cpan 4.0K ./.gnupg 64K ./bin 128K ./.vim 7.7G .
Again, the -h switch gives you the easier to read numeric values.
The --max-depth option let's you control the depth of the display. The default is unlimited depth.

My Other xorg.conf · 67 days ago by Dylan Doxey
This is the xorg.conf from my workstation at home.
# xorg.conf (X.Org X Window System server configuration file) # # You should use dexconf or another such tool for creating a "real" xorg.conf # For example: # sudo dpkg-reconfigure -phigh xserver-xorg Section "Module" Load "glx" Load "v4l" EndSection Section "Monitor" Identifier "Primary Monitor" Vendorname "BenQ" ModelName "FP202W" HorizSync 31-81 VertRefresh 56-76 Option "DPMS" UseModes "BenQ Modes" Gamma 1.0 EndSection Section "Monitor" Identifier "Secondary Monitor" Vendorname "BenQ" ModelName "FP202W" HorizSync 31-81 VertRefresh 56-76 Option "DPMS" UseModes "BenQ Modes" Gamma 1.0 EndSection Section "Screen" Identifier "Primary Screen" Device "nVidia GeForce" Monitor "Primary Monitor" DefaultDepth 24 SubSection "Display" Depth 24 Modes "1680x1050" "1600x1024" "1600x1000" "1400x1050" "1280x1024" "1440x900" "1280x960" "1366x768" "1280x800" "1152x864" "1280x768" "1024x768" "1280x600" "1024x600" "800x600" "768x576" "640x480" EndSubSection EndSection Section "Screen" Identifier "Secondary Screen" Device "nVidia GeForce" Monitor "Secondary Monitor" Defaultdepth 24 SubSection "Display" Depth 24 Modes "1680x1050" "1600x1024" "1600x1000" "1400x1050" "1280x1024" "1440x900" "1280x960" "1366x768" "1280x800" "1152x864" "1280x768" "1024x768" "1280x600" "1024x600" "800x600" "768x576" "640x480" EndSubSection EndSection Section "Device" Identifier "nVidia GeForce" Boardname "nVidia GeForce 7 Series" BusId "PCI:02:00:0" Screen 0 Vendorname "NVIDIA" Option "TwinView" "true" Option "MetaModes" "1680x1050,1680x1050" Option "HorizSync" "DFP-0: 31-81; DFP-1: 31-81" Option "VertRefresh" "DFP-0: 56-76; DFP-1: 56-76" Option "TwinViewOrientation" "DFP-1 LeftOf DFP-0" Option "ConnectedMonitor" "DFP-0,DFP-1" Driver "nvidia" Option "NoLogo" "True" EndSection Section "ServerLayout" Identifier "Default Layout" Screen 0 "Primary Screen" 0 0 Screen 1 "Secondary Screen" RightOf "Primary Screen" EndSection Section "ServerFlags" Option "DefaultServerLayout" "Default Layout" Option "Xinerama" "false" EndSection Section "Modes" Identifier "BenQ Modes" Modeline "1680x1050" 119.00 1680 1728 1760 1840 1050 1053 1059 1080 Modeline "1680x1050" 184.27 1680 1792 1976 2272 1050 1051 1054 1096 Modeline "1680x1050" 181.61 1680 1792 1976 2272 1050 1051 1054 1095 Modeline "1680x1050" 178.96 1680 1792 1976 2272 1050 1051 1054 1094 Modeline "1600x1024" 171.97 1600 1712 1888 2176 1024 1025 1028 1068 Modeline "1600x1024" 168.40 1600 1704 1880 2160 1024 1025 1028 1068 Modeline "1600x1024" 165.94 1600 1704 1880 2160 1024 1025 1028 1067 Modeline "1600x1000" 166.71 1600 1704 1880 2160 1000 1001 1004 1043 Modeline "1600x1000" 164.46 1600 1704 1880 2160 1000 1001 1004 1043 Modeline "1600x1000" 162.05 1600 1704 1880 2160 1000 1001 1004 1042 Modeline "1400x1050" 153.77 1400 1496 1648 1896 1050 1051 1054 1096 Modeline "1400x1050" 151.56 1400 1496 1648 1896 1050 1051 1054 1095 Modeline "1400x1050" 149.34 1400 1496 1648 1896 1050 1051 1054 1094 Modeline "1280x1024" 136.57 1280 1368 1504 1728 1024 1025 1028 1068 Modeline "1280x1024" 134.72 1280 1368 1504 1728 1024 1025 1028 1068 Modeline "1280x1024" 132.75 1280 1368 1504 1728 1024 1025 1028 1067 Modeline "1440x900" 134.52 1440 1536 1688 1936 900 901 904 939 Modeline "1440x900" 132.71 1440 1536 1688 1936 900 901 904 939 Modeline "1440x900" 130.75 1440 1536 1688 1936 900 901 904 938 Modeline "1280x960" 128.13 1280 1368 1504 1728 960 961 964 1002 Modeline "1280x960" 126.27 1280 1368 1504 1728 960 961 964 1001 Modeline "1280x960" 124.54 1280 1368 1504 1728 960 961 964 1001 Modeline "1366x768" 107.78 1368 1448 1592 1816 768 769 772 802 Modeline "1366x768" 106.19 1368 1448 1592 1816 768 769 772 801 Modeline "1366x768" 104.73 1368 1448 1592 1816 768 769 772 801 Modeline "1280x800" 105.78 1280 1360 1496 1712 800 801 804 835 Modeline "1280x800" 104.35 1280 1360 1496 1712 800 801 804 835 Modeline "1280x800" 102.80 1280 1360 1496 1712 800 801 804 834 Modeline "1152x864" 103.59 1152 1224 1352 1552 864 865 868 902 Modeline "1152x864" 102.08 1152 1224 1352 1552 864 865 868 901 Modeline "1152x864" 99.64 1152 1224 1344 1536 864 865 868 901 Modeline "1280x768" 101.60 1280 1360 1496 1712 768 769 772 802 Modeline "1280x768" 99.17 1280 1352 1488 1696 768 769 772 801 Modeline "1280x768" 97.81 1280 1352 1488 1696 768 769 772 801 Modeline "1024x768" 80.71 1024 1080 1192 1360 768 769 772 802 Modeline "1024x768" 79.52 1024 1080 1192 1360 768 769 772 801 Modeline "1024x768" 78.43 1024 1080 1192 1360 768 769 772 801 Modeline "1280x600" 77.82 1280 1344 1480 1680 600 601 604 626 Modeline "1280x600" 76.04 1280 1336 1472 1664 600 601 604 626 Modeline "1280x600" 75.00 1280 1336 1472 1664 600 601 604 626 Modeline "1024x600" 62.26 1024 1080 1184 1344 600 601 604 626 Modeline "1024x600" 61.42 1024 1080 1184 1344 600 601 604 626 Modeline "1024x600" 59.86 1024 1072 1176 1328 600 601 604 626 Modeline "800x600" 48.18 800 840 920 1040 600 601 604 626 Modeline "800x600" 47.53 800 840 920 1040 600 601 604 626 Modeline "800x600" 46.87 800 840 920 1040 600 601 604 626 Modeline "768x576" 44.83 768 808 888 1008 576 577 580 601 Modeline "768x576" 43.52 768 800 880 992 576 577 580 601 Modeline "768x576" 42.93 768 800 880 992 576 577 580 601 Modeline "640x480" 30.25 640 664 728 816 480 481 484 501 Modeline "640x480" 29.84 640 664 728 816 480 481 484 501 Modeline "640x480" 29.43 640 664 728 816 480 481 484 501 EndSection

Jaunty Jackalope Coming Soon · 72 days ago by Dylan Doxey
I did a fresh install on my Acer TravelMate 4200.
My first impressions:
- very attractive default login page
- wireless manager is more refined and convenient than ever
- installing restricted codecs for mplayer is pretty convenient
More to come as the impulses to write about it come.

My latest xorg.conf · 79 days ago by Dylan Doxey
I had to basically write this from scratch when I recently did a fresh Kubuntu install. Let's make sure that doesn't happen again.
# xorg.conf
Section "Module"
Load "glx"
Load "kbd"
Load "mouse"
EndSection
Section "InputDevice"
Identifier "The Keyboard"
Driver "kbd"
Option "CoreKeyboard"
Option "XkbRules" "xorg"
Option "XkbModel" "pc105"
Option "XkbLayout" "us"
EndSection
Section "InputDevice"
Identifier "The Mouse"
Driver "mouse"
Option "CorePointer"
EndSection
Section "Monitor"
Identifier "Left Monitor"
VendorName "DELL"
ModelName "DELL 1907FP"
HorizSync 30.0 - 81.0
VertRefresh 56.0 - 76.0
EndSection
Section "Monitor"
Identifier "Right Monitor"
VendorName "DELL"
ModelName "DELL 1907FP"
HorizSync 30.0 - 81.0
VertRefresh 56.0 - 76.0
EndSection
Section "Screen"
Identifier "Right Screen"
Monitor "Right Monitor"
Device "Video Card A"
Option "TwinView" "True"
DefaultDepth 24
EndSection
Section "Screen"
Identifier "Left Screen"
Monitor "Left Monitor"
Device "Video Card B"
Option "TwinView" "True"
EndSection
Section "Device"
Identifier "Video Card A"
BusID "PCI:1:0:0"
Screen 0
VendorName "nVidia Corporation"
BoardName "GeForce 7300 LE"
Driver "nvidia"
Option "NoLogo" "True"
EndSection
Section "Device"
Identifier "Video Card B"
BusID "PCI:1:0:0"
Screen 1
VendorName "nVidia Corporation"
BoardName "GeForce 7300 LE"
Driver "nvidia"
Option "NoLogo" "True"
EndSection
Section "ServerLayout"
Identifier "Default Layout"
Screen 0 "Right Screen"
Screen 1 "Left Screen" LeftOf "Right Screen"
InputDevice "The Keyboard" "CoreKeyboard"
InputDevice "The Mouse" "CorePointer"
EndSection
Section "ServerFlags"
Option "Xinerama" "0"
DefaultServerLayout "Default Layout"
EndSection

Charile Rose · 108 days ago by Dylan Doxey
I've taken quite a liking to watching Charlie Rose.
A conversation with Reid Hoffman of LinkedIn
http://www.charlierose.com/view/interview/10128
A conversation with Marissa Mayer, V.P. of Search Product and User Experience, Google
http://www.charlierose.com/view/interview/10129
A conversation with entrepreneur and software engineer Marc Andreessen
http://www.charlierose.com/view/interview/10093
A conversation with Evan Williams, Co-founder of Twitter.com
http://www.charlierose.com/view/interview/10118
A conversation with Jen-Hsun Huang, CEO Nvidia
http://www.charlierose.com/view/interview/10060
A conversation with Chris DeWolfe And Tom Anderson, founders of Myspace.com
http://www.charlierose.com/view/interview/10054
A conversation with Arianna Huffington
http://www.charlierose.com/view/interview/9705
A conversation with Eric Schmidt, CEO of Google
http://www.charlierose.com/view/interview/10131
A conversation with Jeff Bezos, Amazon.com
http://www.charlierose.com/view/interview/10105

Like ZipRealty.com... almost? · 197 days ago by Dylan Doxey
I like ziprealty.com because it's not trying to impress you with all the wistles, bells and gadgets.
You can set some criterion, search, and look at results.
But I want more. Well, actually I want less. I want fewer things distracting me and cluttering up the page.
So I wrote this GreaseMonkey script to help out.
Features:
- Strips nearly all superfluous text and up-sell promos.
- House detail links open in another window/tab.
- House address headers link to Google Maps on detail view.
- My Homes view has a tidy list suitable for cut & paste into an email for your agent.
ZipRealty give you enough information that you can nearly do your agent's job for him/her. And why not? I enjoy browsing the houses and composing my weekend visit lists.
Happy house shopping!

Extended Characters in Your JavaScript · 201 days ago by Dylan Doxey
So, there you are happily coding away on your web application for your Japanese audience. You think you've buttoned it all up, and you'll go ahead and give it a courtesy end user test before you launch it. just to be sure.
And there it is, the dreaded corrupt CJK characters in your JavaScript.
Surely you were going for something more like this.
Do not fret! There is a reliable solution.
Generally you might be inclined to do this:
var characters = prompt( 'こんにちは、世界的', '' );
And why the heck not?
This solution is fine, provided there is no confusion about character encoding anywere between your text editor and the web client's browser software.
This confusion could arise in a number of places. To mention a few:
- your code editing software
- your file transfer software
- your server file system
- your sebserver
- the client web browser.
If at any point in this chain of custody something makes an assumption about the encoding, then your wide characters may become corrupt.
The solution is to use the JavaScript Unicode escaped version of the characters that go beyond the ASCII range.
var characters = prompt( '\u3053\u3093\u306b\u3061\u306f\u3001\u4e16\u754c\u7684', '' );
Sweet! Problem solved. Now we can all go back to our stations and continue having been edified with this new insight!
Well, not quite.
Who really knows the Unicode values of the CJK text they're working with? Surely, no one.
Yes, that's right, it's another opportunity to write some code.
#!/usr/bin/perl
use strict;
use Encode qw( decode_utf8 );
if ( !@ARGV ) {
print "\nUsage:\n $0 some string to encode\n\n";
exit;
}
my $js_encoded = "";
my $string = decode_utf8( join ' ', @ARGV );
for my $char (split //, $string) {
my $unicode = sprintf '%0.4x', ord $char;
$js_encoded .= '\u' . $unicode;
}
print "\nJS Encoded:\n";
print " $js_encoded\n";
And there you have it. Now all you need to do is run each snippet of CJK text you want to include in your JavaScript through this program.
dylan@dev.doxey.org$: ~ ./js_encode.pl こんにちは、世界的
JS Encoded:
\u3053\u3093\u306b\u3061\u306f\u3001\u4e16\u754c\u7684
Happy computing!

Filtering Columns · 203 days ago by Dylan Doxey
I love grep, sed, and especially awk.
However, I find that I generally only use awk for one thing -- filtering by columns.
Suppose I want to see the login names and their home directories for the users on my system.
cat /etc/passwd | awk -F: '{print $1" "$6}'
Beautiful. Works perfect. Who could ask for more?
Me.
I really like quotation marks and curly brackets. I really do. They make code look awesome. They're also good exercise for my keyboarding skills. But they're a little awkward to type, over and over again.
Yes, it's a little tiring typing out all that awesome looking code. Sometimes I wish my command lining around could go just a little more efficiently.
So, how about we make code written in modules and scripts really awesome looking with sweet punctuation marks and stuff, and then the code we're slinging at the command prompt be a little more streamlined?
Here's what I have in mind as an alternative to the sweet awk program above.
cat /etc/passwd | cols -F: 1 6
Yep, I guess that completes the design phase of this project.
I think it should go a little something like this.
#!/usr/bin/perl
use strict;
my $usage = "\nUsage:\n \$ $0 [--first|--last|n|...] [-Fx]\n\n"
. "Where n is the 1 indexed column number and x "
. "is some character (or Perl regex) to split on.\n"
. "You may specify multiple columns.\n"
. "If not specified, n is 1 and x is \\\\s.\n\n";
my $split_regex;
my @cols;
# Note, if an argument is 0, then the arg reading loop ends.
ARG:
while ( my $arg = shift @ARGV ) {
if ( $arg =~ m/\A -F \s* (.+)? \z/msx ) {
$arg = $1 ? $1 : shift @ARGV;
$split_regex = qr/$arg/;
next ARG;
}
$arg =~ s/--first/0/;
$arg =~ s/--last/-1/;
if ( $arg eq '--help' || $arg !~ m/\A -? \d+ \z/msx ) {
print $usage;
exit;
}
# reduce to zero index
$arg--;
push @cols, $arg;
}
# Default split regex
$split_regex = qr/\s/
if !$split_regex;
# Default index
@cols = (0)
if !@cols;
# Do the work
for my $line (<STDIN>) {
print '' . ( join ' ', ( split $split_regex, $line )[@cols] ) . "\n";
}
Use the force!

Sony IFX-125 Memory Stick slot and Board - A-8066-718-A · 214 days ago by Dylan Doxey
I just pulled this "Sony IFX-125 Memory Stick slot and Board - A-8066-718-A" out of my old Vaio SR33K. I never used it, and it just takes up space in the case.
I thought to myself, "Self, maybe someone really wants one of these."

