Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.

Re: Handling very big gz.Z files

by mbethke (Hermit)
on Feb 07, 2013 at 05:07 UTC ( [id://1017555]=note: print w/replies, xml ) Need Help??

in reply to Handling very big gz.Z files

Hi albascura,
your problem is in this line:

my $dom = Mojo::DOM->new->parse(scalar do { local $/; <STDIN> });

You're just exhausting your memory there---slurping the whole file that has a couple of gigabytes uncompressed is already likely to bring common desktops to their limit, and then building the DOM on that will fail on anything but the biggest irons.

To fix this, you'll have to say goodbye to the convenient all-in-memory DOM, but luckily XML::Twig allows for almost the same convenience with almost the low memory consumption of an XML stream parser like XML::SAX. Something like this should do it:

use XML::Twig; use IO::Handle; my $stdin = IO::Handle->new(); $stdin->fdopen(fileno(STDIN),"r") or die "fdopen STDIN: $!"; XML::Twig->new( twig_handlers => { 's' => \&process_sentence } )->safe_parse($stdin); sub process_section { my( $t, $elem) = @_; ... $elem->purge; # don't want to print the original text }
$t->text in the process_section callback returns the sentence text so it should be equivalent to $chunk in your for my $chunk ( $dom->find('s')->each ) { loop except that it doesn't include the start/end tags and drops any tags that might appear within the sentence.

I used to work with the BNC at university but almost always via SARA so I can't remember specifics about its markup. ISTR that they used some weird entities though so maybe you have to use the keep_encoding< code> option to <code>new()

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1017555]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2024-06-17 14:35 GMT
Find Nodes?
    Voting Booth?

    No recent polls found

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.