Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Re: how to quickly parse 50000 html documents?

by JavaFan (Canon)
on Nov 25, 2010 at 23:11 UTC ( #873738=note: print w/replies, xml ) Need Help??

in reply to how to quickly parse 50000 html documents?

That seems like a pretty regular structure. If you know that all the documents look like that, you can extract the values with a handful of simple regular expressions.

However, if the HTML documents can contain just about anything, including comments and attribute values that have content that looks like HTML, you'd need a full parser. You first have to parse your HTML, then parse the resulting structure, looking for a table that contains your data. This may be hard - the document could contain hundreds of tables, and you'll have to find the right one.

  • Comment on Re: how to quickly parse 50000 html documents?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://873738]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2018-05-20 22:06 GMT
Find Nodes?
    Voting Booth?