Beefy Boxes and Bandwidth Generously Provided by pair Networks RobOMonk
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Handling very big gz.Z files

by mbethke (Hermit)
on Feb 07, 2013 at 05:07 UTC ( #1017555=note: print w/ replies, xml ) Need Help??


in reply to Handling very big gz.Z files

Hi albascura,
your problem is in this line:

my $dom = Mojo::DOM->new->parse(scalar do { local $/; <STDIN> });

You're just exhausting your memory there---slurping the whole file that has a couple of gigabytes uncompressed is already likely to bring common desktops to their limit, and then building the DOM on that will fail on anything but the biggest irons.

To fix this, you'll have to say goodbye to the convenient all-in-memory DOM, but luckily XML::Twig allows for almost the same convenience with almost the low memory consumption of an XML stream parser like XML::SAX. Something like this should do it:

use XML::Twig; use IO::Handle; my $stdin = IO::Handle->new(); $stdin->fdopen(fileno(STDIN),"r") or die "fdopen STDIN: $!"; XML::Twig->new( twig_handlers => { 's' => \&process_sentence } )->safe_parse($stdin); sub process_section { my( $t, $elem) = @_; ... $elem->purge; # don't want to print the original text }
$t->text in the process_section callback returns the sentence text so it should be equivalent to $chunk in your for my $chunk ( $dom->find('s')->each ) { loop except that it doesn't include the start/end tags and drops any tags that might appear within the sentence.

I used to work with the BNC at university but almost always via SARA so I can't remember specifics about its markup. ISTR that they used some weird entities though so maybe you have to use the keep_encoding< code> option to <code>new()


Comment on Re: Handling very big gz.Z files
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1017555]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2014-04-23 19:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (554 votes), past polls