Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Ignoring not well-formed (invalid token) errors

by Laurent_R (Canon)
on Jan 19, 2015 at 07:24 UTC ( [id://1113708]=note: print w/replies, xml ) Need Help??


in reply to Ignoring not well-formed (invalid token) errors

If the error is always showing the same pattern, maybe you could preprocess the file to remove the offending line(s). I know that the idea of preprocessing 13 GB is not very attractive, but sometimes you have to bite the bullet.

Je suis Charlie.
  • Comment on Re: Ignoring not well-formed (invalid token) errors

Replies are listed 'Best First'.
Re^2: Ignoring not well-formed (invalid token) errors
by bitingduck (Chaplain) on Jan 19, 2015 at 08:15 UTC

    If it's looking for a simple pattern it might be doable in a reasonable amount of time. There are extractors for the Open Directory Project and Wikipedia dumps, both of which are in the many GB range, that can process very quickly, even on relatively old machines. I was pulling all of the music content out of ODP in less than a few minutes some 10 years ago on a mac laptop that was reasonably current then, and I don't recall how long it took to pull all the music topics out of Wikipedia, but I think it was quite reasonable.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1113708]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-19 03:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found