Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re: how to quickly parse 50000 html documents?

by JavaFan (Canon)
on Nov 25, 2010 at 23:11 UTC ( #873738=note: print w/ replies, xml ) Need Help??

in reply to how to quickly parse 50000 html documents?

That seems like a pretty regular structure. If you know that all the documents look like that, you can extract the values with a handful of simple regular expressions.

However, if the HTML documents can contain just about anything, including comments and attribute values that have content that looks like HTML, you'd need a full parser. You first have to parse your HTML, then parse the resulting structure, looking for a table that contains your data. This may be hard - the document could contain hundreds of tables, and you'll have to find the right one.

Comment on Re: how to quickly parse 50000 html documents?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://873738]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (8)
As of 2015-10-06 22:38 GMT
Find Nodes?
    Voting Booth?

    Does Humor Belong in Programming?

    Results (163 votes), past polls