|Perl: the Markov chain saw|
Updated: corrected some broken tags and added some where needed. The italic paras are to differenciate my words from extractions from the docs.
I recently responded to this post. Despite the title, I read what the poster was trying to achieve and decided that he wasn't trying to "parse HTML"--in fact he didn't give a fig for the HTML-- he was simply trying to 'extract' some data that was embedded amongst some other data that just happened to *BE* HTML. That seemed like a perfectly reasonable thing to do. In fact, given the Monk's quip regularly displayed at the top of this site--Practical Extraction and Reporting Language--I would consider it a bread & butter Perl application.
As I am still trying to get to grips with regexes, it seemed like an interesting challenge, so I cut & paste the posters code, added a print statement and ran it.
I was greeted with a screenful of HTML that took about 3 minutes to stop. Hmm. Ok. I ran it again redirecting the output to a file and then opened it in my editor. A touch short of 1930 lines of quite nicely structured, fairly clean HTML. A quick search to find the relevant block and found this
So, using my editor: I shifted all the lines left; replaced \n with \s+\n; replaced " " with \s; and replace all the 'content' bits with ([^<]+?). I did the last bit semi-manually, ie. I typed the replacement string, cut it to the buffer, highlighted the 8 bits one at a time and pasted. Then I wrapped a m//sx around the whole thing, assigning the captures to an array; modified the print statement to print the array. Commented out the call to LWP::Simple and added a slurp from the file I created to speed the testing. This took maybe 3-4 minutes.
I ran it and got
So I escaped the @ and ran it again and was rewarded with what I wanted. Switched the code back to using the LWP and tried it. It worked.
The code I ended up with was
Total time to build and test, under 10 minutes nearer 7 as I recall, but I'll be conservative.
Okay. Simple question, simple answer, I posted. I then saw someone else had posted a "party line" answer, so I added a note identifying that I thought this was a Practical Extraction rather than a parsing HTML question and left it at that. Within 10 minute the post had gathered 2 down votes. At that point I withdrew the post (controversial decision I know), posted it on my scratchpad, and /msg'd the questioner with the location. And resolved to write this.
Prior to withdrawal, I also added an update asking the downvoter(s) to explain their decision. No response, but more on that later.
To see the saga of me trying to do this "the right way" please
Please note: This is not about loss of XP. I'm burden with what is almost an embarrassment of riches where XP is concerned and whilst I disagree with the usual "XP and a quid* will buy you a coffee at MacDonald's" quote in as much as XP usually brings a smile to my face whereas Mac's coffee never does, I see it as a fun inducement to posting and membership, and somemeasure of other monk's approval for your efforts and little more.
I the absence of any other explanation, I thought the only reason for down voting a piece of simple, working code was that it went against the party line by not using an HTML::X module, so I thought I would have a go at doing the same thing the "proper" way.
First thing to do was to decide which of the HTML modules to use. AS 5.6.1 which I'm currently using comes with a host of these as standard so I thought I would start with one of those--but which one?
Well, here I am 3 hours later and the only one that seemed like it might work for this, without a huge learning curve and gobs of code, is HTML::PullParser, so here goes start simple.
Start by checking the syntax Perl -c. Code compiles clean. Run this on a local copy of the html and it produces
I wonder what that means. Look for a error code section: Nothing. Ok, look for an examples section... it has one...
That's it? ... Yes! That is IT! Nothing.
Okay, I noticed that it referred me back to the HTML::Parser docs earlier, see what that produces. It has a Diagnostics section, and there are quite a few error messages listed with explanations. But not
Info not collected for any events at C:\test\202414-2.pl line 23
Something else I just noticed. C:\test\202414-2.pl is my script allright, but its only has 13 lines!! Unlucky for me I guess.
Sod this for a game of soldiers. Maybe, just maybe, if I needed to do this, and I needed to re-write the HTML, or I needed to utilise the structure of the HTML for some purpose, maybe it would be worth pursuing this, but I don't and it ain't. So there.
The monk regularly quips, "Be a heretic", so I will.
If I need to extract a small piece of information from a page of HTML, I'll use a regex. It took less than 10 minutes to do, it would take less still to re-write it if the page changes, and I had so far spent 3-hours looking at this and got nowhere. On that basis, the page would need to change format in a way that breaks my regex 18 times, before this wasted 3 hours would be repaid. And even then, there is, as far I can see, no guarantee that those changes wouldn't break a working script using HTML::Parser, and if it did, it would be an awful lot harder to put right.
You know, if the information required by the OP was contained in an e-mail in a paragraph something like this:
The Pure-Dream server on the net at http://www.pure-dream.com/ (ip-address: 126.96.36.199) (or on bnetd at bnetd://188.8.131.52/) is a games server in Europe running PvPGN BnetD Mod 1.1.6 Linux. It currently lists 42 users playing 9 games and has been up for 0d 00:40. If you wish to contact the webmaster (DreamDiver) to get an account you may do so by sending mail to email@example.com.
Noone here would hesitate in recommending a regex to extract that information.
So why, just because the surrounding dross happens to be HTML, do people get so insistent that "You can't do that with a regex, you gotta use a module".?
If as and when I need to parse html, that is I need to determine and manipulate the markup itself, I would learn to use one of the above modules. However, if all I need is to extract a peices of information from the content of a page of html, I'll stick to a regex.
If any of you guys that are regular users of one of the HTML::x modules feel like showing me how this should be done, I'd love to see the code. To all those that haven't tried doing something similar to this using one of the HTML::x modules, please don't advocate their use to others until you have.