|
|
| XP is just a number | |
| PerlMonks |
Possible HTML::TokeParser::Simple Bugby swiftone (Curate) |
|
| on Jan 28, 2003 at 17:54 UTC ( #230667=perlquestion: print w/ replies, xml ) | Need Help?? |
|
swiftone has asked for the
wisdom of the Perl Monks concerning the following question:
Is this a HTML::TokeParser::Simple bug I should send along to Ovid, or does the problem exist between chair and keyboard?
Summary of problem: Some text tokens are getting split into two tokens. Sample test case below produces as output: I have no idea why the "Development" ends up on it's own line. This is the smallest sample from my data that gave these results -- adding more data "moves" the problem, but the problem still exists. Taking out the space between "Training" and "Development" in the data makes the new compound word the one that goes to its own line. It's acting as if some buffer length is interfering, but it isn't just the token length (making a longer text token will change the position of the problem, but it doesn't necessarily hit the longest token -- in fact, in this sample set, it continues to hit the last text token somewhere.) I've skimmed the docs for HTML::Parser, (I've used 3.25 and 3.26) HTML::TokeParser(2.24) , and HTML::TokeParser::Simple(1.4). I've tried in on different boxes. (one aging SuSE box with 5.6.0, one Debian unstable with 5.8.0) any ideas?
Back to
Seekers of Perl Wisdom
|
|
||||||