Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
XP is just a number
 
PerlMonks

Possible HTML::TokeParser::Simple Bug

by swiftone (Curate)
 | Log in | Create a new user | The Monastery Gates | Super Search | 
 | Seekers of Perl Wisdom | Meditations | PerlMonks Discussion | 
 | Obfuscation | Reviews | Cool Uses For Perl | Perl News | Q&A | Tutorials | 
 | Poetry | Recent Threads | Newest Nodes | Donate | What's New | 

on Jan 28, 2003 at 17:54 UTC ( #230667=perlquestion: print w/ replies, xml ) Need Help??
swiftone has asked for the wisdom of the Perl Monks concerning the following question:

Is this a HTML::TokeParser::Simple bug I should send along to Ovid, or does the problem exist between chair and keyboard?

Summary of problem: Some text tokens are getting split into two tokens. Sample test case below

#!/usr/bin/perl -w· use strict; use HTML::TokeParser::Simple; my $html = q( <option value="STAFE">STAFE - 900 - BEN PROG - Food Assistance</option +> <option value="STAM7">STAM7 - 900 - BEN PROG - Med Asst - Lynchbrg</op +tion> <option value="STAM8">STAM8 - 900 - BEN PROG - Med Asst - Marion</opti +on> <option value="STAM9">STAM9 - 900 - BEN PROG - Med Asst - Petrsbrg</op +tion> <option value="STAMA">STAMA - 900 - BEN PROG - Medical Assistance</opt +ion> <option value="STATA">STATA - 900 - BEN PROG - Economic Assistance</op +tion> <option value="STATR">STATR - 900 - BEN PROG - Training Development</o +ption>); my $p = HTML::TokeParser::Simple->new(\$html); while(my $token = $p->get_token){ if($token->is_text){ my $text = $token->return_text; next unless $text =~ /\S/; print "[$text]\n"; } }
produces as output:
[STAFE - 900 - BEN PROG - Food Assistance] [STAM7 - 900 - BEN PROG - Med Asst - Lynchbrg] [STAM8 - 900 - BEN PROG - Med Asst - Marion] [STAM9 - 900 - BEN PROG - Med Asst - Petrsbrg] [STAMA - 900 - BEN PROG - Medical Assistance] [STATA - 900 - BEN PROG - Economic Assistance] [STATR - 900 - BEN PROG - Training] [ Development]
I have no idea why the "Development" ends up on it's own line. This is the smallest sample from my data that gave these results -- adding more data "moves" the problem, but the problem still exists. Taking out the space between "Training" and "Development" in the data makes the new compound word the one that goes to its own line.

It's acting as if some buffer length is interfering, but it isn't just the token length (making a longer text token will change the position of the problem, but it doesn't necessarily hit the longest token -- in fact, in this sample set, it continues to hit the last text token somewhere.)

I've skimmed the docs for HTML::Parser, (I've used 3.25 and 3.26) HTML::TokeParser(2.24) , and HTML::TokeParser::Simple(1.4). I've tried in on different boxes. (one aging SuSE box with 5.6.0, one Debian unstable with 5.8.0) any ideas?

Comment on Possible HTML::TokeParser::Simple Bug
Select or Download Code
Re: Possible HTML::TokeParser::Simple Bug
by Thelonius (Curate) on Jan 28, 2003 at 18:10 UTC
    Sure, it's processing the string in 512-byte chunks. See this paragraph in HTML::Parser
    $p->unbroken_text( $bool )
    By default, blocks of text are given to the text handler as soon as possible (but the parser makes sure to always break text at the boundary between whitespace and non-whitespace so single words and entities always can be decoded safely). This might create breaks that make it hard to do transformations on the text. When this attribute is enabled, blocks of text are always reported in one piece. This will delay the text event until the following (non-text) event has been recognized by the parser.
Re: Possible HTML::TokeParser::Simple Bug
by Ovid (Archbishop) on Jan 28, 2003 at 19:40 UTC

    As Thelonius has pointed out, this is not a bug in HTML::TokeParser::Simple, but a feature of HTML::Parser. Converting your code to HTML::TokeParser demonstrates this (and show why the Simple module is easier to use. Once again, I had to look up the array indices :)

    #!/usr/bin/perl -w use strict; use HTML::TokeParser; my $html = q( <option value="STAFE">STAFE - 900 - BEN PROG - Food Assistance</option +> <option value="STAM7">STAM7 - 900 - BEN PROG - Med Asst - Lynchbrg</op +tion> <option value="STAM8">STAM8 - 900 - BEN PROG - Med Asst - Marion</opti +on> <option value="STAM9">STAM9 - 900 - BEN PROG - Med Asst - Petrsbrg</op +tion> <option value="STAMA">STAMA - 900 - BEN PROG - Medical Assistance</opt +ion> <option value="STATA">STATA - 900 - BEN PROG - Economic Assistance</op +tion> <option value="STATR">STATR - 900 - BEN PROG - Training Development</o +ption>); my $p = HTML::TokeParser->new(\$html); while(my $token = $p->get_token){ if($token->[0] eq 'T'){ my $text = $token->[1]; next unless $text =~ /\S/; print "[$text]\n"; } }

    To be perfectly honest, though, I had not encountered this problem before. I think I'm going to think about the best way to make HTML::TokeParser::Simple DWIM. At the very least, I should update the docs to mention this (and correct a few tyops in the docs).

    Cheers,
    Ovid

    New address of my CGI Course.
    Silence is Evil (feel free to copy and distribute widely - note copyright text)

Login:
Password
remember me
What's my password?
Create A New User

Node Status?
node history
Node Type: perlquestion [id://230667]
Approved by blokhead
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (17)
bart
jdporter
Your Mother
holli
Gavin
atcroft
CardinalNumber
kennethk
thezip
Eyck
Perlbotics
LanX
crashtest
eff_i_g
ssandv
MikeDexter
TerribleD
As of 2010-02-09 22:12 GMT
Sections?
The Monastery Gates
Seekers of Perl Wisdom
Meditations
PerlMonks Discussion
Categorized Q&A
Tutorials
Obfuscated Code
Perl Poetry
Cool Uses for Perl
Perl News
Information?
PerlMonks FAQ
Guide to the Monastery
What's New at PerlMonks
Voting/Experience System
Tutorials
Reviews
Library
Perl FAQs
Other Info Sources
Find Nodes?
Nodes You Wrote
Super Search
List Nodes By Users
Newest Nodes
Recently Active Threads
Selected Best Nodes
Best Nodes
Worst Nodes
Saints in our Book
Leftovers?
The St. Larry Wall Shrine
Offering Plate
Awards
Craft
Snippets Section
Code Catacombs
Quests
Editor Requests
Buy PerlMonks Gear
PerlMonks Merchandise
Planet Perl
Perlsphere
Use Perl
Perl.com
Perl 5 Wiki
Perl Jobs
Perl Mongers
Perl Directory
Perl documentation
CPAN
Random Node
Voting Booth?

What level of existential comfort do you require?

Palace
Executive suite at the best hotel
Regular hotel in a decent part of town
Motel
Boarding house
Sleeping Bag on Couch in Basement
Any port in a storm
Camping under the freeway overpass
Jail
Other

Results (283 votes), past polls