Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^2: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!)

by brengo (Acolyte)
on Nov 26, 2010 at 00:27 UTC ( #873745=note: print w/ replies, xml ) Need Help??


in reply to Re: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!)
in thread how to quickly parse 50000 html documents?

Wow. Just wow. Thank you for these lines and great ideas ("discard first half of matches" and use regex)!

Just a small thing: the regex gives me the whole line instead of just the values back when running it (what did I miss?):

$ perl -nle"m[<font size=1>([^<]+)</font></td></tr>] and print $1" jun +k.html <tr bgcolor=#DFDFDF><td><font size=1>drill diameter:</font></td> + <td +><font size=1>936</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>drill depth:</font></td> + <td><font s +ize=1>20</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>drill speed:</font></td> + <td><font size=1>4</font></ +td></tr> <tr bgcolor=#CCCCCC><td><font size=1>drill material:</font></td> + <td +><font size=1>506</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>height:</font></td> + <td><font s +ize=1>502</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>width:</font></td> + <td><font size=1>6</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>angle:</font></td> + <td><font size=1>2.76</font +></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>cooling liquid:</font></td> + <td><font s +ize=1>14</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>manufactured in:</font></td> + <td +><font size=1>27</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>lane code:</font></td> + <td><font size=1>76 +</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>quality test 1:</font></td> + <td +><font size=1>581 (11.4%)</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>quality procedure:</font></td> + <td +><font size=1>19,021</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>quality test 2:</font></td> + <td><font s +ize=1>843 (90.1%)</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>package worth:</font></td> + <td><font s +ize=1>$257,524</font></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>single unit worth:</font></td> + <td +><font size=1>$90,945</font></td></tr> <tr bgcolor=#CCCCCC><td><font size=1>colour:</font></td> + <td><font size=1>48</font>< +/td></tr> <tr bgcolor=#DFDFDF><td><font size=1>coating:</font></td> + <td><font size=1>2,602</fon +t></td></tr> <tr bgcolor=#DFDFDF><td><font size=1>sold this month:</font></td> + <td><font size=1>118</font> +</td></tr>


Comment on Re^2: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!)
Download Code
Re^3: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!)
by BrowserUk (Pope) on Nov 26, 2010 at 00:54 UTC

    Going by your prompt, you are running on some kind of *nix system; in which case you should swap the "s for 's.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://873745]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (10)
As of 2015-07-06 08:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (70 votes), past polls