http://www.perlmonks.org?node_id=53294

Jrchak has asked for the wisdom of the Perl Monks concerning the following question:

Hey Now, just a little question. Ok I understand regular expressions and all but what I am looking to do is to parse HTML. Sounds simple enough. Heres and example of what I want to do: Say there is a static webpage with some HTML that looks like this
<p><b>Location:</b> Northern Africa, bordering the Mediterranean Sea, between Egypt and Tu +nisia <p><b>Geographic coordinates:</b> 25 00 N, 17 00 E <p><b>Map references:</b> Africa <p><b>Area:</b> <br><i>total:</i> 1,759,540 sq km <br><i>land:</i> 1,759,540 sq km <br><i>water:</i> 0 sq km <p><b>Area - comparative:</b> slightly larger than Alaska <p><b>Land boundaries:</b> <br><i>total:</i> 4,383 km <br><i>border countries:</i> Algeria 982 km, Chad 1,055 km, Egypt 1,150 km, Niger 354 km, Sudan 383 + km, Tunisia 459 km <p><b>Coastline:</b> 1,770 km
how would I parse this to enter into a database so that under Location would go the appropiate data and so on. And also how would you best recive the HTML, say from http://www.foobar.com/poopy/data.html how would I best get the HTML and then parse it into a database. Thank you if you can help out at all

Replies are listed 'Best First'.
Re: So Simple, Yet no tutorial covers it
by chromatic (Archbishop) on Jan 21, 2001 at 09:32 UTC
    You would use HTML::Parser or HTML::TokeParser.

    Parsing HTML with a regex is a good way either to become the next Buddha or to lose your hair prematurely. It is very difficult to handle all but the most trivial HTML correctly.

    Retrieving the web page is also very simple, if you use LWP::Simple.

    Each of these modules comes with prodigious documentation, and your examples are simple enough that the given examples will work with only slight modifications.

      You would use HTML::Parser or HTML::TokeParser.

      I have to disagree with this bit of advice. Using a fully-fledged HTML parser, IMO, usually does not help in extracting structured data from an html page. Parsing and extracting the markup often adds an unneeded layer of complexity to the task, and offers little in the way of additional "resiliance" to changes in page design/layout (one can reasonably argue that parsing markup results in more sensitivity to such changes).

      I agree that parsing html with regex is a bad idea, but it is not the html that one is generally interested in. The html says little or nothing about the structure of the data inside the document. If the document in question is script-generated, I suggest that you simply grab the data using whatever sensible regex you can come up with, and don't mess with parsing the html out of it.

        If the problem domain were no more complex than the original poster's example, I personally wouldn't use HTML::Parser either.

        It's always hard to gauge the knowledge level of a poster, and it's tempting to assume that people can walk the fine line between hand-rolling a simple, one-shot solution and not painting yourself into a corner.

        I'd rather give a beginning programmer another tool to use than recommending a tricky use of an existing tool.

        Different approach, judgment call on the part of anyone who answers the question. (If more people used HTML markup to express semantic divisions of a document, parsing would certainly be more useful).

      thanks
      My own preference is to turn the HTML into valid XML and then parse the XML. There are a number of ways to do this. One way is to use CPAN modules to change the HTML into POD and the POD into XML. I do not know of a module which will change HTML into XML directly.

      Once I have XML I can then parse using XML::DOM. That way I can concentrate on learning how to use XML::DOM well by using it for all markup, both HTML and XML.

      I respect the work that went into making HTML::Parser. But the next version of HTML called XHTML is going to be XML compliant. All of the XML parsers will work on it, and I do not see why we would be needing a specialized HTML parser.

        I do not know of a module which will change HTML into XML directly

        Install XML::Pyx and HTML::Parser, then do :

        pyxhtml myfile.html | pyxw > myfile.xhtml

        Et voila!

        Of course if your original HTML is full of font tags and the likes it might be quite difficult to do anything even with the XML version.

        And as merlyn would mention if he was around development on HTML has stopped at the W3C and XHTML is the official successor and it _is_ XML compliant (although nothing is quite that simple and some XML tools, especially editors, might not quite like it).

        And now is time for one of my pet peeves: As for using XML::DOM, quite frankly I don't know anybody who has used the DOM and who likes it. And that includes me and several other people who have written DOM tutorials! If you're looking at XML modules XML::Pyx is great for simple processing, XML::Simple would also work for something like the original format, of course my own XML::Twig works fine, and finally if you want to use DOM-like objects XML::XPath is probably much easier to use, besides being faster, more powerful and better supported. OK, I think that was the last time I bother you with my anti-DOM crusade! Actually I am not alone there, the people who wrote JDOM used to have a "Say NO to DOM" logo.

(jeffa) Re: So Simple, Yet no tutorial covers it
by jeffa (Bishop) on Jan 21, 2001 at 10:44 UTC
    Here is some code for you to study. It uses the advice given by chromatic - chalk it up to another boring Saturday night. :)

    Keep in mind that this is not an exact science, there is a bit of art involved - mainly in finding a way to extract the data into the fields you wish to store, without having to write code that is unmaintainable. My solution uses a method that I am not particularly found of - hard coded array indexes, but it works for the example you gave. It will break if the web pages you are parsing tend to change from page to page.

    Also, I do not bother with any actual database code, since I do not know what your tables look like.

    #!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::Parser; # get the content of the web page my $content = get("http://www.foobar.com/poopy/data.html"); # instaniate a new parser and let it crunch our data my @lines; my $parser = new MyParser; $parser->parse($content); # now the data is in @lines - there is more than one way # to do this - if you know that __EVERY__ web page will # have the __SAME__ layout, you can hardcode your indexes # for example, at this point @lines looks like this: # # 0 - Location: # 1 - Northern Africa, bordering the Mediterranean Sea, between Egypt + and Tunisia # 2 - Geographic coordinates: # 3 - 25 00 N, 17 00 E # 4 - Map references: # 5 - Africa # 6 - Area: # 7 - total: # 8 - 1,759,540 sq km # 9 - land: # 10 - 1,759,540 sq km # 11 - water: # 12 - 0 sq km # 13 - Area - comparative: # 14 - slightly larger than Alaska # 15 - Land boundaries: # 16 - total: # 17 - 4,383 km # 18 - border countries: # 19 - Algeria 982 km, Chad 1,055 km, Egypt 1,150 km, Niger 354 km, Su +dan 383 km, Tunisia 459 km # 20 - Coastline: # 21 - 1,770 km # # so now I can store my variables by accessing the proper index # in the array: (only the first 5 - you do the rest :) my ($location, $coords, $refs, $total_area, $land_area); $location = $lines[1]; $coords = $lines[3]; $refs = $lines[5]; $total_area = $lines[8]; $land_area = $lines[10]; # not pretty - but it works for the example # insert into database - you will have to implement # your own subroutine for this &insert_row($location, $coords, $refs, $total_area, $land_area); #################################################################### # package MyParser - inheritance and event-driven programming # are the things to study if you want to understand how this works #################################################################### { package MyParser; use base qw(HTML::Parser); # override the text sub to simply store the # plain text of the content in a linear fashion sub text { my ($self, $origtext) = @_; # first, any remove leading and trailing white space # there are better ways to do this, but it's late *yawn* $origtext =~ s/^\s*//; $origtext =~ s/(.*)\s*$/$1/; # forgive me Ovid # finally, only store the line if it's not empty push(@lines, $origtext) if $origtext =~ m/\w+/; } }

    Jeff

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    F--F--F--F--F--F--F--F--
    (the triplet paradiddle)
    
      Exactly what I'm looking for. Might be sacrilige here to say it, but I didn't really care how I did this, i.e VBScript, whatever. But starting out doing an initial search of the web led me to www.asptoday.com, with an article on 'Parsing Web Pages From ASP Through HTTP' by Randall Kindig, which I'd have had the pleasure of having to pay $5 for, or subscribing $10 a month to their site, kind of irritating. Excellent site PerlMonks lots of freely shared information, and saved me a significant amount of time on a couple of things so far.
Re: So Simple, Yet no tutorial covers it
by Jrchak (Initiate) on Jan 22, 2001 at 07:21 UTC
    ok, i wrote this code, just a simple regex thing to extract the "Location" data. Here it is:
    #!/usr/local/bin/perl -w use LWP::Simple; #program starts $totalAdress = "http://www.foobar.com/htmlschuff/data.html"; $content = get($totalAdress) || die "doesn't seem to be a valid adress +: $!"; #Location $start = "<p><b>Location:</b>\n"; $end = "<p>"; if ($content =~ /\b$start\b(.*?)\b$end\b/) { $Location = $1; print $Location; } print "\ndone\n";
    And it won't work. Now I'm a newbie, ok. So its probably that the regex cant take variables but I'm not sure. All i get when i use this code against a real page that has the same data as the original example and this is the program output:
    $./thecode.pl done $
    Any idea, I'm sure its some newbie mistake but it beats me.
      Change your if conditional line with the regex to this:
      if ($content =~ /$start\s*(.*?)\s*$end/) {
      you really don't have to have the word boundaries, and you should always expect possible white space.

      Remember, TIMTOWTDI

      Jeff

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      F--F--F--F--F--F--F--F--
      (the triplet paradiddle)
      
        And take out .*? You probably want .+? Since this will grab at least one character for sure. Also, adding an else clause after the if would be a good way to notify the user if there is not match, or to provide a default location, like 'no location given'.