comment on

Updated: corrected some broken tags and added some where needed. The italic paras are to differenciate my words from extractions from the docs.

I recently responded to this post. Despite the title, I read what the poster was trying to achieve and decided that he wasn't trying to "parse HTML"--in fact he didn't give a fig for the HTML-- he was simply trying to 'extract' some data that was embedded amongst some other data that just happened to *BE* HTML. That seemed like a perfectly reasonable thing to do. In fact, given the Monk's quip regularly displayed at the top of this site--Practical Extraction and Reporting Language--I would consider it a bread & butter Perl application.

As I am still trying to get to grips with regexes, it seemed like an interesting challenge, so I cut & paste the posters code, added a print statement and ran it.

#! perl -sw
use strict;

use LWP::Simple;
my $html = get("http://pvpgnservers.ath.cx");
print ">>>\n$html\n<<<\n";
[download]

I was greeted with a screenful of HTML that took about 3 minutes to stop. Hmm. Ok. I ran it again redirecting the output to a file and then opened it in my editor. A touch short of 1930 lines of quite nicely structured, fairly clean HTML. A quick search to find the relevant block and found this

  <tr>
    <td><font size=1><a href="bnetd://217.172.178.113/">217.172.178.11
+3</a></font></td>
    <td><a target="_blank" href="http://www.pure-dream.com"><font size
+=1>Pure-Dream</font></a></td>
    <td><font size=1>Europe</font></td>
    <td align=right><font size=1>0d 00:40</font></td>
    <td><font size=1><a href="mailto:webmaster@pure-dream.com">DreamDi
+ver</a></font></td>
    <td><font size=1>PvPGN&nbsp;BnetD Mod 1.1.6 Linux</font></td>
    <td align=right><font size=1>42</font></td>
    <td align=right><font size=1>9</font></td>
  </tr>
[download]

So, using my editor: I shifted all the lines left; replaced \n with \s+\n; replaced " " with \s; and replace all the 'content' bits with ([^<]+?). I did the last bit semi-manually, ie. I typed the replacement string, cut it to the buffer, highlighted the 8 bits one at a time and pasted. Then I wrapped a m//sx around the whole thing, assigning the captures to an array; modified the print statement to print the array. Commented out the call to LWP::Simple and added a slurp from the file I created to speed the testing. This took maybe 3-4 minutes.

I ran it and got

C:\test>202414 202414.htm
Possible unintended interpolation of @pure in string at C:\test\202414
+.pl line 9.
Global symbol "@pure" requires explicit package name at C:\test\202414
+.pl line 9.
Execution of C:\test\202414.pl aborted due to compilation errors.

C:\test>
[download]

So I escaped the @ and ran it again and was rewarded with what I wanted. Switched the code back to using the LWP and tried it. It worked.

The code I ended up with was

#! perl -sw
use strict;

use LWP::Simple;
my $html = get("http://pvpgnservers.ath.cx");
#my $html = do{local $/; <>; };

my @stuff = $html =~
m!
<tr>\s+
<td><font\ssize=1><a\shref="bnetd://217.172.178.113/">([^<]+?)</a></fo
+nt></td>\s+
<td><a\starget="_blank"\shref="http://www.pure-dream.com"><font\ssize=
+1>([^<]+?)</font></a></td>\s+
<td><font\ssize=1>([^<]+?)</font></td>\s+
<td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+
<td><font\ssize=1><a\shref="mailto:webmaster@pure-dream.com">([^<]+?)<
+/a></font></td>\s+
<td><font\ssize=1>([^<]+)</font></td>\s+
<td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+
<td\salign=right><font\ssize=1>([^<]+?)</font></td>\s+
</tr>\s+
<tr>
!sx;

 print "@stuff\n";
[download]

Total time to build and test, under 10 minutes nearer 7 as I recall, but I'll be conservative.

Okay. Simple question, simple answer, I posted. I then saw someone else had posted a "party line" answer, so I added a note identifying that I thought this was a Practical Extraction rather than a parsing HTML question and left it at that. Within 10 minute the post had gathered 2 down votes. At that point I withdrew the post (controversial decision I know), posted it on my scratchpad, and /msg'd the questioner with the location. And resolved to write this.

Prior to withdrawal, I also added an update asking the downvoter(s) to explain their decision. No response, but more on that later.

To see the saga of me trying to do this "the right way" please

Please note: This is not about loss of XP. I'm burden with what is almost an embarrassment of riches where XP is concerned and whilst I disagree with the usual "XP and a quid* will buy you a coffee at MacDonald's" quote in as much as XP usually brings a smile to my face whereas Mac's coffee never does, I see it as a fun inducement to posting and membership, and somemeasure of other monk's approval for your efforts and little more.

I the absence of any other explanation, I thought the only reason for down voting a piece of simple, working code was that it went against the party line by not using an HTML::X module, so I thought I would have a go at doing the same thing the "proper" way.

First thing to do was to decide which of the HTML modules to use. AS 5.6.1 which I'm currently using comes with a host of these as standard so I thought I would start with one of those--but which one?

AsSubs

functions that construct a HTML syntax tree.

No good, this calls for de-construction if anything

Element

Class for objects that represent HTML elements -

10 minutes reading to confirm its for construction.

- traverse

discussion of HTML::Element's traverse method

Just more about the above

Entities

Encode or decode strings with HTML entities

Nope.

Filter

Filter HTML text through the parser

This module is deprecated. HTML::Parser now provides the functionally of HTML::Filter

Skip ahead to HTML::Parser

Form

It's not a form.

HeadParser

Not interested in the HEAD :^)

LinkExtor

Most of the info isn't links.

Parse

Deprecated.

Parser

HTML parser class....SYNOPSIS use HTML::Parser (); Useful!

Objects of the HTML::Parser class will recognize markup and separate it from plain text (alias data content) in HTML documents. As different kinds of markup and text are recognized, the corresponding event handlers are invoked.

Hmmm. But I'm not interested in the markup, only the content.

If event driven parsing does not feel right for your application, you might want to use HTML::PullParser

Read on gungadin maybe it's just what you need..... 10 minutes later. Nope. To use this I need to find a way to determine when I am called back with the content I'm after, but the content is in one block of 20 or so nearly identical blocks, each containing the same sequence of nested tag groups, each of which contains one bit of the content I am after. That means devising a state machine for the whole 128k, 1930 line web page to record what I have seen so I will know when I get called with the bits I want. And of course, if they add a group above me, or move to using style sheets instead of font tags or add or remove a field from the displayed content, I have to start over. So....I won't be doing it that way. They mention HTML::PullParser

PullParser

Alternative HTML::Parser interface

You associate a file (or any IO::Handle object or string) with the parser at construction time and then repeatedly call $parser->get_token to obtain the tags and text found in the parsed document

Sounds possible. Pass the string with the HTML, sit in a loop searching for ?something?, then keep pulling the next token, compare it with the next thing I looking for push the content to my array.

Of course, its still going to rely upon the web-page not changing layouts and given the layout of the html, its going to be impossible to determine programmatically if they have added or removed a field as most of the don't have a handy tag by which they can be isolated.

Tagset

Basically a utility module for HTML::Element or HTML::TreeBuilder

TokeParser

Alternative HTML::Parser interface

An alternative interface, but no easier to use from what I could see after 10 minutes of reading.

Tree

Seems very powerful, but after an hour of reading the docs, I didn't see anything that would make it especially easy to locate one nearly identical row of a table from the 20+ other rows of data. I could count of course, but it looks to me like the page itself is generated and if one of the sites it examines is down. it will simply omit it's row in the table , then the counting method goes out the window.

- AboutObjects

More of the same

- AboutTrees

More of the same

- Scanning

More of the same

TreeBuilder

More of the same

Well, here I am 3 hours later and the only one that seemed like it might work for this, without a huge learning curve and gobs of code, is HTML::PullParser, so here goes start simple.

#! perl -sw
use strict;
use LWP::Simple;
use HTML::PullParser;
use Data::Dumper;

#my $html = get("http://pvpgnservers.ath.cx");
my $html = do{local $/; <>; };

my $p= HTML::PullParser->new( doc => \$html );

print Dumper($p);
[download]

Start by checking the syntax Perl -c. Code compiles clean. Run this on a local copy of the html and it produces

C:\test>202414-2 202414.htm
Info not collected for any events at C:\test\202414-2.pl line 23

C:\test>
[download]

I wonder what that means. Look for a error code section: Nothing. Ok, look for an examples section... it has one...

EXAMPLES

The 'eg/hform' script shows how we might parse the form section of HTM
+L::Documents using HTML::PullParser.
[download]

That's it? ... Yes! That is IT! Nothing.

Okay, I noticed that it referred me back to the HTML::Parser docs earlier, see what that produces. It has a Diagnostics section, and there are quite a few error messages listed with explanations. But not

Info not collected for any events at C:\test\202414-2.pl line 23

Something else I just noticed. C:\test\202414-2.pl is my script allright, but its only has 13 lines!! Unlucky for me I guess.

Sod this for a game of soldiers. Maybe, just maybe, if I needed to do this, and I needed to re-write the HTML, or I needed to utilise the structure of the HTML for some purpose, maybe it would be worth pursuing this, but I don't and it ain't. So there.

The monk regularly quips, "Be a heretic", so I will.

If I need to extract a small piece of information from a page of HTML, I'll use a regex. It took less than 10 minutes to do, it would take less still to re-write it if the page changes, and I had so far spent 3-hours looking at this and got nowhere. On that basis, the page would need to change format in a way that breaks my regex 18 times, before this wasted 3 hours would be repaid. And even then, there is, as far I can see, no guarantee that those changes wouldn't break a working script using HTML::Parser, and if it did, it would be an awful lot harder to put right.

You know, if the information required by the OP was contained in an e-mail in a paragraph something like this:

The Pure-Dream server on the net at http://www.pure-dream.com/ (ip-address: 217.172.178.113) (or on bnetd at bnetd://217.172.178.113/) is a games server in Europe running PvPGN BnetD Mod 1.1.6 Linux. It currently lists 42 users playing 9 games and has been up for 0d 00:40. If you wish to contact the webmaster (DreamDiver) to get an account you may do so by sending mail to webmaster@pure-dream.com.

Noone here would hesitate in recommending a regex to extract that information.

So why, just because the surrounding dross happens to be HTML, do people get so insistent that "You can't do that with a regex, you gotta use a module".?

If as and when I need to parse html, that is I need to determine and manipulate the markup itself, I would learn to use one of the above modules. However, if all I need is to extract a peices of information from the content of a page of html, I'll stick to a regex.

If any of you guys that are regular users of one of the HTML::x modules feel like showing me how this should be done, I'd love to see the code. To all those that haven't tried doing something similar to this using one of the HTML::x modules, please don't advocate their use to others until you have.

In reply to Being a heretic and going against the party line. by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Clear questions and runnable code get the best and fastest answer
	PerlMonks