Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: how to quickly parse 50000 html documents?

by JavaFan (Canon)
on Nov 25, 2010 at 23:11 UTC ( #873738=note: print w/ replies, xml ) Need Help??


in reply to how to quickly parse 50000 html documents?

That seems like a pretty regular structure. If you know that all the documents look like that, you can extract the values with a handful of simple regular expressions.

However, if the HTML documents can contain just about anything, including comments and attribute values that have content that looks like HTML, you'd need a full parser. You first have to parse your HTML, then parse the resulting structure, looking for a table that contains your data. This may be hard - the document could contain hundreds of tables, and you'll have to find the right one.


Comment on Re: how to quickly parse 50000 html documents?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://873738]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (10)
As of 2015-07-07 00:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (86 votes), past polls