Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: how to quickly parse 50000 html documents? (Updated: 50,000 pages in 3 minutes!)

by BrowserUk (Pope)
on Nov 25, 2010 at 22:43 UTC ( #873736=note: print w/ replies, xml ) Need Help??

Help for this page

Select Code to Download


  1. or download this
    >perl -nle"m[<font size=1>([^<]+)</font></td></tr>] and print $1" junk
    +.txt
    936
    ...
    48
    2,602
    118
    
  2. or download this
    #! perl -nlw
    use strict;
    ...
    }
    
    print time-$start;
    
  3. or download this
     C:\test>873713 junk*.txt
    ...
    ...
    93 2 4 50 50 6 2.7 1 2 7 581 1902 843 25752 9094 4 260 93 2 4...
    4.07200002670288
    ^Z
    

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://873736]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (9)
As of 2015-07-03 13:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (53 votes), past polls