Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Syntactic Confectionery Delight
 
PerlMonks  

how to quickly parse 50000 html documents?

by brengo (Acolyte)
on Nov 25, 2010 at 19:16 UTC ( #873713=perlquestion: print w/ replies, xml ) Need Help??
brengo has asked for the wisdom of the Perl Monks concerning the following question:

Hey monks, I'd like to fill a database with values that I grab from 50000 html documents. There is no API available and I can't decide what method to use to parse the html structure.

Right now I have saved all the files locally (later on a direct access via web would be great) and they look like this:

... (the usual html, head, body tags, a table, some text) <table width=75%><tr><td width=50%><table width=95%><tr><td width=45% +valign=top> <table width=100% cellspacing=0 cellpadding=0><tr bgcolor=#DFDFDF><td +colspan=2 height=30><font size=4><center>tool1_name</center></font></ +td></tr> <tr bgcolor=#999999><td width=70%> <b>heading_1</b> </td><td width=30%></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>drill diameter:</font></td> + <td><font size=1>936</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>drill depth:</font></td> + <td><font size=1>20</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>drill speed:</font></td> + <td><font size=1>4</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>drill material:</font></td> + <td><font size=1>506</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>height:</font></td> + <td><font size=1>502</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>width:</font></td> + <td><font size=1>6</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>angle:</font></td> + <td><font size=1>2.76</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>cooling liquid:</font></td> + <td><font size=1>14</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>manufactured in:</font></td> + <td><font size=1>27</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>lane code:</font></td> + <td><font size=1>76</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>quality test 1:</font></td> + <td><font size=1>581 (11.4%)</font></td +></tr> <tr bgcolor=#CCCCCC><td><font size=1>quality procedure:</font></td> + <td><font size=1>19,021</font></td>< +/tr> <tr bgcolor=#DFDFDF><td><font size=1>quality test 2:</font></td> + <td><font size=1>843 (90.1%)</font></td></t +r> <tr bgcolor=#CCCCCC><td><font size=1>package worth:</font></td> + <td><font size=1>$257,524</font></td></t +r> <tr bgcolor=#DFDFDF><td><font size=1>single unit worth:</font></td> + <td><font size=1>$90,945</font></td> +</tr> <tr bgcolor=#CCCCCC><td><font size=1>colour:</font></td> + <td><font size=1>48</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>coating:</font></td> + <td><font size=1>2,602</font></td></tr> </table><br> <table width=100% cellspacing=0 cellpadding=0><tr bgcolor=#999999><td +width=70%> <b>sells</b> </td><td width=30%></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>sold this month:</font></td> + <td><font size=1>118</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>sold in plant A:</font></td> (...)

There are about 110 unique values in 12 tables that I have to grab. On the pages are always two sets of these values: first the values (110 values in 12 tables) of a reference drill, then the values that are interesting to me.

So how do I parse these files quickly, reading all these values (stripped of dollar signs, commas, percentages) as quickly as possible?

I guess I'd use File::Slurp to store a file in a scalar, then HTML::TableExtract (How do I get the second occurrence?)? Or should I use a regex (how do I get the second occurrence?)? Or a template (how?)?

I'd be very grateful for your ideas and I really would appreciate code-snippets as I am really new to perl (replacing a bash script (yep) now...

Thanks!

Comment on how to quickly parse 50000 html documents?
Download Code
Re: how to quickly parse 50000 html documents?
by chrestomanci (Priest) on Nov 25, 2010 at 22:05 UTC

    There was a discussion thread about this a couple of weeks ago.

    Short answer:

    • Don't use bare regular expressions unless the html pages are very simple and very consistent, as you will drive yourself mad trying to catch all the corner cases, and you will end up writing a buggy HTML parser.
    • Instead go to CPAN and download an HTML parser that someone else has already written and debugged. HTML::TreeBuilder and HTML::TokeParser::Simple both come Highly recommended
    • You should use a GUI HTML tree inspector such as Firebug, or the inspect element tool in google chrome, to tell you where the elements you are looking for are in the HTML structure.

    PS: It won't be that quick. In my experience, HTML::TreeBuilder takes a significant fraction of a second to load an average HTML document, so multiply that by 50_000 files and you are looking at a day or so of processing time. Also you need to explicitly delete html trees once you are done with them, as the data structures generated by HTML::TreeBuilder will not be freed automatically when they go out of scope, and each will consume several megabytes of RAM.

      Thanks for all the links! Yes, I already looked a bit at HTML::TreeBuilder but as I understand zilch of it now I wanted to be sure that this is the right tool. The fact that the trees look powerful but with a huge overhead made me ask whether a regex would be faster. The web pages look consistent and a regex to give the number after the second occurence of "drill width:" should do fine.

      Using a combination of grep|sed|tr|, each looped over each of the variables and all the files, my crapcode takes about 28hrs right now so everything that makes it faster is welcome.

        I would agree that HTML::TreeBuilder looks daunting, but it is not that hard to use once you are used to it. Here is a snippet from a script I wrote recently that uses HTML::TreeBuilder to pull some data out of a table. (Feel free to copy it if you like.)

        sub parseResPage { my ( $rawHTML ) = @_; my $tree = HTML::TreeBuilder->new_from_content( $rawHTML ); my @tables = $tree->look_down('_tag', 'table'); # We wa +nt the second table my @tableRows = $tables[1]->look_down('_tag', 'tr'); # First ro +w is headings, then the data my $headRow = shift @tableRows; my @headings; my $res_hash; my @cells = $headRow->look_down('_tag', 'td'); push @headings, $_->as_text() foreach (@cells); foreach my $mainRow ( @tableRows ) { my @cells = $mainRow->look_down('_tag', 'td'); my $iface = $cells[0]->as_text(); for( my $i=0; $i<scalar@cells; $i++ ) { $res_hash->{$iface}{ $headings[$i] } = $cells[$i]->as_text +(); } } # Explicity free the memory consumed by the tree. $tree->delete(); return $res_hash; }

        Tip: If you are not already familiar with the perl command line debugger then now is the time to learn. When I am working with HTML::TreeBuilder code, my usual approach is to write a script that just loads the tree and sets a break point afterwards, and then start running $tree->look_down() commands interactively until I find a combination that gives me what I am looking for. I then paste that back into my editor and use it in my script.

        I suspect that if you write a script that uses HTML::TreeBuilder then it will probably end up being slower than your simple grep based script. HTML::TreeBuilder is well optimised perl written by some clever people, but it contains lots code to handle malformed HTML, and other corner cases, so it will be slower than a simple regular expression based script. Why are you so concerned about speed anyway? How much time have you spent on writing these scripts already?

Re: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!)
by BrowserUk (Pope) on Nov 25, 2010 at 22:43 UTC

    For counterpoint, I ran your snippet through the following one liner and got pretty much what you need:

    >perl -nle"m[<font size=1>([^<]+)</font></td></tr>] and print $1" junk +.txt 936 20 4 506 502 6 2.76 14 27 76 581 (11.4%) 19,021 843 (90.1%) $257,524 $90,945 48 2,602 118

    Your description says each page consists of two same-sized set of these values, so just discard the first half. You don't want the dollar signs, commas or percentages, so post process to remove them.

    People will tell you that this is fragile, and will break if the page is changed. But any solution will break if the pages change, but given how simple this is, it'll will probably be quicker to fix this, than any solution that relies upon fuzzy parsing of a whole heap of stuff that you have no interest in whatsoever.

    Just as I don't bother reading the stories on the newspaper my fish&chips comes wrapped in before eating; I don't bother parsing a bunch of html I've no interest in. Ie. Don't parse; simply extract.

    It will certainly be a whole heap faster. Given your description of the size of the files, I estimate that the above should be able to process each page in about 1/20th of a second, giving you a total time of about 40 minutes instead of your current 28hrs.

    Update: I revise my estimate to just over 3 minutes based upon running this code:

    #! perl -nlw use strict; use Time::HiRes qw[ time ]; BEGIN{ @ARGV = map glob, @ARGV } local $/; my $start = time; while( <> ) { my @vals; while( m[<font size=1>([^<]+)</font></td></tr>]g ) { my $val = $1; $val =~ tr[$,][]d; $val =~ s[^\s*([0-9.]+).+$][$1]e; push @vals, $val; } print "@vals[ @vals /2 .. $#vals ]"; } print time-$start;

    Over 1000 copies of a mocked up file containing 10 copies of your snippet (5 as the reference; 5 as the wanted) in 4 seconds:

    C:\test>873713 junk*.txt ... 93 2 4 50 50 6 2.7 1 2 7 581 1902 843 25752 9094 4 260 93 2 4... 93 2 4 50 50 6 2.7 1 2 7 581 1902 843 25752 9094 4 260 93 2 4... 93 2 4 50 50 6 2.7 1 2 7 581 1902 843 25752 9094 4 260 93 2 4... 93 2 4 50 50 6 2.7 1 2 7 581 1902 843 25752 9094 4 260 93 2 4... 4.07200002670288 ^Z

    Even if the page layout changes, the 27 hrs 57 minutes you saved each time you need to do this, should cover the 5 minutes it will take to re-write it :)


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Wow. Just wow. Thank you for these lines and great ideas ("discard first half of matches" and use regex)!

      Just a small thing: the regex gives me the whole line instead of just the values back when running it (what did I miss?):

      $ perl -nle"m[<font size=1>([^<]+)</font></td></tr>] and print $1" jun +k.html <tr bgcolor=#DFDFDF><td><font size=1>drill diameter:</font></td> + <td +><font size=1>936</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>drill depth:</font></td> + <td><font s +ize=1>20</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>drill speed:</font></td> + <td><font size=1>4</font></ +td></tr> <tr bgcolor=#CCCCCC><td><font size=1>drill material:</font></td> + <td +><font size=1>506</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>height:</font></td> + <td><font s +ize=1>502</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>width:</font></td> + <td><font size=1>6</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>angle:</font></td> + <td><font size=1>2.76</font +></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>cooling liquid:</font></td> + <td><font s +ize=1>14</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>manufactured in:</font></td> + <td +><font size=1>27</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>lane code:</font></td> + <td><font size=1>76 +</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>quality test 1:</font></td> + <td +><font size=1>581 (11.4%)</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>quality procedure:</font></td> + <td +><font size=1>19,021</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>quality test 2:</font></td> + <td><font s +ize=1>843 (90.1%)</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>package worth:</font></td> + <td><font s +ize=1>$257,524</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>single unit worth:</font></td> + <td +><font size=1>$90,945</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>colour:</font></td> + <td><font size=1>48</font>< +/td></tr> <tr bgcolor=#DFDFDF><td><font size=1>coating:</font></td> + <td><font size=1>2,602</fon +t></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>sold this month:</font></td> + <td><font size=1>118</font> +</td></tr>

        Going by your prompt, you are running on some kind of *nix system; in which case you should swap the "s for 's.

      the example html is truly appaling, using outdated html attributes for presentation, and not providing any elements that specify document structure. it is just the conditions that these sorts of html's live in, that make it conducive to break frequently and unexpectedly...as the person owning the document uses frontpage (or worse) msword to generate html content, with outdated html attributes all over the place. as there's no real structural html elements, with just look/feel elements and data intetmixed, changes to document by owner are usually very naive in terms of valid/sane html. for example you may start seeing at some stage several empty opening and closing font tags. still looks exactly same as before, but the naive generation of html using msword or older frontpage, now breaks the scraping code and you're back at square one trying to figure it out.
      whilst scraping (sometimes very bad) html is inevitable, the way you go about it can make some difference. basing a regex for html scraping on the value of a particular attribute is particularly bad, e.g. don't look for "font size="1">"....if you must base it on the font tag, just look for the tag and nearest closing brace, as an anchor.
      i've worked with a federated searching java based engine for some time, and it is exactly when the vendor wrote scraping code to match frequently changing html (e.g. html attributes) that often ended up breaking these scrapers. so instead of moving onto bigger and better things..you end up maitaining a whole lot of scrapers that break all the time, and you're at the pointy end of the "fix it now".
      in my opinion the example html is so bad as to be practically of no use, and you might as well use module or whatever to strip html altogether, and just base the scraping of the well defined terms that have a following colon.
      btw when i mentioned "naive" html document owner, i don't mean to be nasty, just means they don't know or do any better for whatever reason. it's naive in terms of using html with regard to spec and current best practice.
      the hardest line to type correctly is: stty erase ^H
        in my opinion the example html is so bad as to be practically of no use, and you might as well use module or whatever to strip html altogether, and just base the scraping of the well defined terms that have a following colon.

        There is a simple maxim taught to me by my first boss in programming: don't do what won't benefit you.

        All we have to go on is that bad html snippet the OP posted. In all likelihood, all he has to go on is that html snippet grabbed from whatever website it came from. We could try to predict what might happen in the future and cater for it, but the highest probability is that whatever we guess will be wrong.

        The only sensible thing to do is work with what we know. And what we know for now is that the simple regex used works. If, in the future it changes, then the 5 minutes it took to construct the program above maybe be required to be repeated. If it then changes again, maybe there would be some pattern to the change that might suggest a better approach. But, it might never change; and any effort expended now to try and cater for unknown changes that might never happen would be entirely wasted.

        If these numbers were embedded in a plain text document, no one here would blink an eye about using regex. But add a few <> into the mix and suddenly many start trotting out cargo-cult wisdoms: "Don't parse HTML/XML/XHTML/whatever with regex"; completely missing that most of the time nobody wants to parse the html; just extract some small subset of text from a larger set of text. Ie. They want to do exactly what regex are designed to do.

        basing a regex for html scraping on the value of a particular attribute is particularly bad, e.g. don't look for "font size="1">"....if you must base it on the font tag, just look for the tag and nearest closing brace, as an anchor.

        I'll take your word for the quality or lack thereof of the html, because I neither know nor care. It's just text within text to me.

        For now, what I've suggested to the OP works. And it works 500 times more quickly that his existing solution. If he gets to use it once before the sources changes, he can afford to spend 3 working days re-writing it and still have gained. And it took me less than 5 minutes to write this version and maybe 10 to test it; most of which was taken up generating 1000 test pages. If he gets to use it 10 times, he's saved himself enough time to take a month's vacation.

        It's simple. It works. Job done. And if it requires change next week, or next month or next year, it is simple enough that it won't require deep knowledge of half a dozen co-dependant packages and APIs in order to fix it.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        appaling (sic), you say?

        Well, the nested tables are awkward and the use of various outdated or deprecated tags is unfortunate; the lack of quotes and the like can certainly be labeled "mistakes." But "appalling" is a pretty strong word. Perhaps "dated" or similar would be better.

        ...so bad as to be practically of no use.

        Even harsher (and IMO, excessive), particularly since what we know about the html fails to support any inference that OP bears any responsibility.

        There is, however, a valuable nugget that saves your post from a quick downvote -- the notion that future changes could break a regex solution. OTOH, any solution we can readily offer today would also be broken were the html converted to 100% compliant xml.

Re: how to quickly parse 50000 html documents?
by JavaFan (Canon) on Nov 25, 2010 at 23:11 UTC
    That seems like a pretty regular structure. If you know that all the documents look like that, you can extract the values with a handful of simple regular expressions.

    However, if the HTML documents can contain just about anything, including comments and attribute values that have content that looks like HTML, you'd need a full parser. You first have to parse your HTML, then parse the resulting structure, looking for a table that contains your data. This may be hard - the document could contain hundreds of tables, and you'll have to find the right one.

Re: how to quickly parse 50000 html documents?
by afoken (Parson) on Nov 26, 2010 at 10:48 UTC

    So how do I parse these files quickly, reading all these values (stripped of dollar signs, commas, percentages) as quickly as possible?

    I guess I'd use File::Slurp to store a file in a scalar, then HTML::TableExtract (How do I get the second occurrence?)? Or should I use a regex (how do I get the second occurrence?)? Or a template (how?)?

    Well, I'm tempted to answer "start by parsing one file, repeat that for the remaining 49.999 files".

    No, really. Start with one HTML file, write readable code, DON'T optimize AT ALL. Use whatever seems to be reasonable. Don't slurp files yourself if the parsing module has a function to read from a file. Try if your code works with a second HTML file, and a third. Fix bugs. Still, DON'T optimize.

    svn commit (you may also use CSV, git, whatever. But make sure you can get back old versions of your code.)

    Now, install Devel::NTYProf, and run perl -d:NYTProf yourscript.pl file1.html followed by nytprofhtml. Open nytprof/index.html and find out which code takes the most time to run. Look at everything with a red background. Optimize that code, and only that code.

    Repeat until you find no more code to optimize.

    Repeat with several other HTML files.

    Be prepared to find modules (from CPAN) that are far from being optimized for speed. Try to switch to a different module if your script spends most of the time in a third-party module. Run NYTProf again after switching. Compare total time used before and after switching. Use whatever is faster. (For example, I learned during profiling that XML::LibXML was more than 10 times faster than XML::Twig with my problem and my data.)

    Repeat profiling with several files at once, find code that is called repeatedly without need to do so. Eleminate that code if it slows down processing.

    Note that HTML and XML are two different things that have very much in common. Perhaps XML::LibXML is able to parse your HTML documents (using the parse_html_file() method) good enough to be helpful, and faster than any pure Perl module could ever run. Try if XML::LibXML can read your HTML documents at all, then compare the speed using NYTProf.

    <update>If you have a multi-processor machine, try to run several jobs in parallel. Have a managing process that keeps N (or 2N) worker processes working, where N is the number of CPU cores.</update>

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://873713]
Approved by NiJo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2014-04-19 13:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (480 votes), past polls