Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Re: how to quickly parse 50000 html documents?

by JavaFan (Canon)
on Nov 25, 2010 at 23:11 UTC ( #873738=note: print w/ replies, xml ) Need Help??

in reply to how to quickly parse 50000 html documents?

That seems like a pretty regular structure. If you know that all the documents look like that, you can extract the values with a handful of simple regular expressions.

However, if the HTML documents can contain just about anything, including comments and attribute values that have content that looks like HTML, you'd need a full parser. You first have to parse your HTML, then parse the resulting structure, looking for a table that contains your data. This may be hard - the document could contain hundreds of tables, and you'll have to find the right one.

Comment on Re: how to quickly parse 50000 html documents?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://873738]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (12)
As of 2015-11-30 21:04 GMT
Find Nodes?
    Voting Booth?

    What would be the most significant thing to happen if a rope (or wire) tied the Earth and the Moon together?

    Results (783 votes), past polls