http://www.perlmonks.org?node_id=202482

Updated: corrected some broken tags and added some where needed. The italic paras are to differenciate my words from extractions from the docs.

I recently responded to this post. Despite the title, I read what the poster was trying to achieve and decided that he wasn't trying to "parse HTML"--in fact he didn't give a fig for the HTML-- he was simply trying to 'extract' some data that was embedded amongst some other data that just happened to *BE* HTML. That seemed like a perfectly reasonable thing to do. In fact, given the Monk's quip regularly displayed at the top of this site--Practical Extraction and Reporting Language--I would consider it a bread & butter Perl application.

As I am still trying to get to grips with regexes, it seemed like an interesting challenge, so I cut & paste the posters code, added a print statement and ran it.

#! perl -sw use strict; use LWP::Simple; my $html = get("http://pvpgnservers.ath.cx"); print ">>>\n$html\n<<<\n";

I was greeted with a screenful of HTML that took about 3 minutes to stop. Hmm. Ok. I ran it again redirecting the output to a file and then opened it in my editor. A touch short of 1930 lines of quite nicely structured, fairly clean HTML. A quick search to find the relevant block and found this

<tr> <td><font size=1><a href="bnetd://217.172.178.113/">217.172.178.11 +3</a></font></td> <td><a target="_blank" href="http://www.pure-dream.com"><font size +=1>Pure-Dream</font></a></td> <td><font size=1>Europe</font></td> <td align=right><font size=1>0d 00:40</font></td> <td><font size=1><a href="mailto:webmaster@pure-dream.com">DreamDi +ver</a></font></td> <td><font size=1>PvPGN&nbsp;BnetD Mod 1.1.6 Linux</font></td> <td align=right><font size=1>42</font></td> <td align=right><font size=1>9</font></td> </tr>
So, using my editor: I shifted all the lines left; replaced \n with \s+\n; replaced " " with \s; and replace all the 'content' bits with ([^<]+?). I did the last bit semi-manually, ie. I typed the replacement string, cut it to the buffer, highlighted the 8 bits one at a time and pasted. Then I wrapped a m//sx around the whole thing, assigning the captures to an array; modified the print statement to print the array. Commented out the call to LWP::Simple and added a slurp from the file I created to speed the testing. This took maybe 3-4 minutes.

I ran it and got

C:\test>202414 202414.htm Possible unintended interpolation of @pure in string at C:\test\202414 +.pl line 9. Global symbol "@pure" requires explicit package name at C:\test\202414 +.pl line 9. Execution of C:\test\202414.pl aborted due to compilation errors. C:\test>

So I escaped the @ and ran it again and was rewarded with what I wanted. Switched the code back to using the LWP and tried it. It worked.

The code I ended up with was

#! perl -sw use strict; use LWP::Simple; my $html = get("http://pvpgnservers.ath.cx"); #my $html = do{local $/; <>; }; my @stuff = $html =~ m! <tr>\s+ <td><font\ssize=1><a\shref="bnetd://217.172.178.113/">([^<]+?)</a></fo +nt></td>\s+ <td><a\starget="_blank"\shref="http://www.pure-dream.com"><font\ssize= +1>([^<]+?)</font></a></td>\s+ <td><font\ssize=1>([^<]+?)</font></td>\s+ <td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+ <td><font\ssize=1><a\shref="mailto:webmaster@pure-dream.com">([^<]+?)< +/a></font></td>\s+ <td><font\ssize=1>([^<]+)</font></td>\s+ <td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+ <td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+ </tr>\s+ <tr> !sx; print "@stuff\n";

Total time to build and test, under 10 minutes nearer 7 as I recall, but I'll be conservative.

Okay. Simple question, simple answer, I posted. I then saw someone else had posted a "party line" answer, so I added a note identifying that I thought this was a Practical Extraction rather than a parsing HTML question and left it at that. Within 10 minute the post had gathered 2 down votes. At that point I withdrew the post (controversial decision I know), posted it on my scratchpad, and /msg'd the questioner with the location. And resolved to write this.

Prior to withdrawal, I also added an update asking the downvoter(s) to explain their decision. No response, but more on that later.

To see the saga of me trying to do this "the right way" please

Please note: This is not about loss of XP. I'm burden with what is almost an embarrassment of riches where XP is concerned and whilst I disagree with the usual "XP and a quid* will buy you a coffee at MacDonald's" quote in as much as XP usually brings a smile to my face whereas Mac's coffee never does, I see it as a fun inducement to posting and membership, and somemeasure of other monk's approval for your efforts and little more.

I the absence of any other explanation, I thought the only reason for down voting a piece of simple, working code was that it went against the party line by not using an HTML::X module, so I thought I would have a go at doing the same thing the "proper" way.

First thing to do was to decide which of the HTML modules to use. AS 5.6.1 which I'm currently using comes with a host of these as standard so I thought I would start with one of those--but which one?

AsSubs
functions that construct a HTML syntax tree.

No good, this calls for de-construction if anything

Element
Class for objects that represent HTML elements -

10 minutes reading to confirm its for construction.

 - traverse
discussion of HTML::Element's traverse method

Just more about the above

Entities
Encode or decode strings with HTML entities

Nope.

Filter
Filter HTML text through the parser

This module is deprecated. HTML::Parser now provides the functionally of HTML::Filter

Skip ahead to HTML::Parser

Form
It's not a form.
HeadParser
Not interested in the HEAD :^)
LinkExtor
Most of the info isn't links.
Parse
Deprecated.
Parser
HTML parser class....SYNOPSIS use HTML::Parser (); Useful!

Objects of the HTML::Parser class will recognize markup and separate it from plain text (alias data content) in HTML documents. As different kinds of markup and text are recognized, the corresponding event handlers are invoked.

Hmmm. But I'm not interested in the markup, only the content.

If event driven parsing does not feel right for your application, you might want to use HTML::PullParser

Read on gungadin maybe it's just what you need..... 10 minutes later. Nope. To use this I need to find a way to determine when I am called back with the content I'm after, but the content is in one block of 20 or so nearly identical blocks, each containing the same sequence of nested tag groups, each of which contains one bit of the content I am after. That means devising a state machine for the whole 128k, 1930 line web page to record what I have seen so I will know when I get called with the bits I want. And of course, if they add a group above me, or move to using style sheets instead of font tags or add or remove a field from the displayed content, I have to start over. So....I won't be doing it that way. They mention HTML::PullParser

PullParser
Alternative HTML::Parser interface

You associate a file (or any IO::Handle object or string) with the parser at construction time and then repeatedly call $parser->get_token to obtain the tags and text found in the parsed document

Sounds possible. Pass the string with the HTML, sit in a loop searching for ?something?, then keep pulling the next token, compare it with the next thing I looking for push the content to my array.

Of course, its still going to rely upon the web-page not changing layouts and given the layout of the html, its going to be impossible to determine programmatically if they have added or removed a field as most of the don't have a handy tag by which they can be isolated.

Tagset
Basically a utility module for HTML::Element or HTML::TreeBuilder
TokeParser
Alternative HTML::Parser interface

An alternative interface, but no easier to use from what I could see after 10 minutes of reading.

Tree
Seems very powerful, but after an hour of reading the docs, I didn't see anything that would make it especially easy to locate one nearly identical row of a table from the 20+ other rows of data. I could count of course, but it looks to me like the page itself is generated and if one of the sites it examines is down. it will simply omit it's row in the table , then the counting method goes out the window.
 - AboutObjects
More of the same
 - AboutTrees
More of the same
 - Scanning
More of the same
TreeBuilder
More of the same

Well, here I am 3 hours later and the only one that seemed like it might work for this, without a huge learning curve and gobs of code, is HTML::PullParser, so here goes start simple.

#! perl -sw use strict; use LWP::Simple; use HTML::PullParser; use Data::Dumper; #my $html = get("http://pvpgnservers.ath.cx"); my $html = do{local $/; <>; }; my $p= HTML::PullParser->new( doc => \$html ); print Dumper($p);

Start by checking the syntax Perl -c. Code compiles clean. Run this on a local copy of the html and it produces

C:\test>202414-2 202414.htm Info not collected for any events at C:\test\202414-2.pl line 23 C:\test>

I wonder what that means. Look for a error code section: Nothing. Ok, look for an examples section... it has one...

EXAMPLES The 'eg/hform' script shows how we might parse the form section of HTM +L::Documents using HTML::PullParser.

That's it? ... Yes! That is IT! Nothing.

Okay, I noticed that it referred me back to the HTML::Parser docs earlier, see what that produces. It has a Diagnostics section, and there are quite a few error messages listed with explanations. But not

Info not collected for any events at C:\test\202414-2.pl line 23

Something else I just noticed. C:\test\202414-2.pl is my script allright, but its only has 13 lines!! Unlucky for me I guess.

Sod this for a game of soldiers. Maybe, just maybe, if I needed to do this, and I needed to re-write the HTML, or I needed to utilise the structure of the HTML for some purpose, maybe it would be worth pursuing this, but I don't and it ain't. So there.

The monk regularly quips, "Be a heretic", so I will.

If I need to extract a small piece of information from a page of HTML, I'll use a regex. It took less than 10 minutes to do, it would take less still to re-write it if the page changes, and I had so far spent 3-hours looking at this and got nowhere. On that basis, the page would need to change format in a way that breaks my regex 18 times, before this wasted 3 hours would be repaid. And even then, there is, as far I can see, no guarantee that those changes wouldn't break a working script using HTML::Parser, and if it did, it would be an awful lot harder to put right.

You know, if the information required by the OP was contained in an e-mail in a paragraph something like this:

The Pure-Dream server on the net at http://www.pure-dream.com/ (ip-address: 217.172.178.113) (or on bnetd at bnetd://217.172.178.113/) is a games server in Europe running PvPGN BnetD Mod 1.1.6 Linux. It currently lists 42 users playing 9 games and has been up for 0d 00:40. If you wish to contact the webmaster (DreamDiver) to get an account you may do so by sending mail to webmaster@pure-dream.com.

Noone here would hesitate in recommending a regex to extract that information.

So why, just because the surrounding dross happens to be HTML, do people get so insistent that "You can't do that with a regex, you gotta use a module".?

If as and when I need to parse html, that is I need to determine and manipulate the markup itself, I would learn to use one of the above modules. However, if all I need is to extract a peices of information from the content of a page of html, I'll stick to a regex.

If any of you guys that are regular users of one of the HTML::x modules feel like showing me how this should be done, I'd love to see the code. To all those that haven't tried doing something similar to this using one of the HTML::x modules, please don't advocate their use to others until you have.

Replies are listed 'Best First'.
Re: Being a heretic and going against the party line.
by Chmrr (Vicar) on Oct 03, 2002 at 13:23 UTC

    I can certainly see your point, and personally I would never downvote anyone for giving an example which worked and which fit the requirements.

    Personally, I find that using the HTML::* modules makes the code cleaner, more compact, and easier to understand what's doing on. For example, here's how I'd write the bit of code in question, using HTML::TreeBuilder (my personal favorite for doing such manipulations):

    #!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TreeBuilder; my $html = get("http://pvpgnservers.ath.cx") or die "Getting html: $!" +; my $tree = HTML::TreeBuilder->new_from_content($html) or die "Building html tree: $!"; $tree = $tree->look_down("_tag"=>"a", "href"=>"http://www.pure-dream.com") ->look_up("_tag","tr"); $tree->objectify_text(); print join ' ', map {$_->attr("text")} $tree->look_down("_tag","~text" +);

    Whenever I see a huge regex, I must admit that my eyes generally glaze over slightly. Even though ones such as the one you gave are not actually all that complex, they tend to look rather intimidating.

    A personal anecdote on the use of HTML::* modules for parsing; a while back, I wrote a program which, given an ISBN number, would look up basic information off to Amazon, such as title, author, possibly series, and so on. This was nearly a year ago. Just last week, someone asked me if I still had the code around. I dug around and ran it -- and, lo and behold, it spat back information. Despite that Amazon had rearranged the webpage significantly over that time, the extractor still worked.

    I shan't just tell you to drink the kool-aid, but in general most of my solutions, if I have the choice at all, will use HTML::* modules. Why? I've placed my trust in them many a time, and they have yet to let me down. I will suggest that others will do the same, but if they choose not to -- well, that is their choice, and they may well be right or wrong down the road. It's their kool-aid. ;>

    perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^`+*^e v^#$&V"+@( NO CARRIER'

Re: Being a heretic and going against the party line.
by blakem (Monsignor) on Oct 03, 2002 at 13:27 UTC
    Just for a point of reference, here's how I would have done it.... And I didn't have to look at the HTML source even once.
    #!/usr/bin/perl -wT use strict; use HTML::TableExtract; use LWP::Simple; my $te = new HTML::TableExtract(); my $html = get('http://pvpgnservers.ath.cx/') or die; $te->parse($html); for my $row ($te->rows) { print "@$row\n" if $row->[1] eq 'Pure-Dream'; }

    -Blake

Re: Being a heretic and going against the party line.
by davorg (Chancellor) on Oct 03, 2002 at 13:32 UTC

    How well will your solution work when the layout of the page changes subtly? I'm not pretending that the solution I've given here is bullet proof, but it's a lot more flexible than yours is.

    The point is that HTML parsers understand HTML. It's easier to write a solution when you use the right tool for the job. If you look at my solution, the code is very easy to follow - find all the table rows in the HTML, then find one where the text starts with the required IP address, then extract all of the text from that row. I didn't need to go into the detail of the HTML in the same way that you did.

    Yes, it's possible to extract useful data from HTML using regular expressions (the most excellent book Perl & LWP is full of them) but that can only ever be a "use once", quick and dirty hack.

    Oh, and a final comment on your terminology. What we're all doing in this problem is parsing. Data extraction is parsing by any meaningful definition of the term.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: Being a heretic and going against the party line.
by valdez (Monsignor) on Oct 03, 2002 at 12:36 UTC

    Please BrowserUk, don't erase your postings; leave what you wrote in its place, other monks will upvote your nodes if needed. The problem is that I didn't understand what happened at node 202414 until I read this meditation. Of course I can't teach you anything, but if you aren't an XP addicted, why erase a node?

    Btw, I upvoted both nodes, including the withdrawn one.

    Ciao, Valerio

Re: Being a heretic and going against the party line.
by LTjake (Prior) on Oct 03, 2002 at 12:37 UTC
    I believe that your answer was just fine due to the fact that the author of the question specifically mentioned that no HTML::X modules should be used. I've seen similar situations/requests on here before.

    However, I don't think you should've removed your post. Suck it up. If you think your answer is a good one, then leave it be.

    The standard recommendation has always been not to reinvent the wheel. In some situations, those wheels are not available -- TMTOWTDI.
Re: Being a heretic and going against the party line.
by hiseldl (Priest) on Oct 03, 2002 at 13:11 UTC

    Excellent writeup BrowserUk!

    I whince when I think of posting the 'party line' without code to back it up, which is what I see a lot of. It's easy to say 'use module X,' and not so easy to back it up with real live working code. Every time I see a writeup suggesting that the poster 'use module X' and there is no code to back it up, I want to scream! And, your post sums up very nicely, why.

    So, show us you're acumen with 'module X' and put some code in your writeup that shows how to solve the problem in 13 lines or less using that module; that would be tremendously more helpful than an empty plea to 'use module X.'

    --
    hiseldl
    What time is it? It's Camel Time!

Re: Being a heretic and going against the party line.
by McD (Chaplain) on Oct 03, 2002 at 13:25 UTC
    There was a Perl Journal article by Jon Orwant and Dan Gruhl on precisely this back in 1999.

    I've used this technique any number of times for quick hacks. As you correctly observed, there's no need to truck out HTML::* when you just want a snippet of data out of a web page.

    Of course, you need to grok regexen pretty well to wield this magic, and the list of cavaets is as long as my arm, but my point is that for quick hacks, this is fine. Don't be ashamed of your heresy. :-)

    Peace,
    -McD

Re: Being a heretic and going against the party line.
by rdfield (Priest) on Oct 03, 2002 at 12:18 UTC
    Well said, BrowserUK. A good argument against "Cargo Cult noding" methinks.

    rdfield

Re: Being a heretic and going against the party line.
by talexb (Chancellor) on Oct 03, 2002 at 14:32 UTC

    This is a 'Me, too!' node, but I just want to reinforce what others have said.

    While I hate getted --'ed too, stick it out and leave the post as is -- it provides and will provide useful information to the current and future readers. I've had a few ugly posts myself, to do with XOR encryption and more recently Tact and the Monastery, but I figure you have to colour outside the lines once in a while to remind yourself where the lines are.

    --t. alex
    but my friends call me T.