Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Handling very big gz.Z files

by mbethke (Hermit)
on Feb 07, 2013 at 05:07 UTC ( #1017555=note: print w/replies, xml ) Need Help??


in reply to Handling very big gz.Z files

Hi albascura,
your problem is in this line:

my $dom = Mojo::DOM->new->parse(scalar do { local $/; <STDIN> });

You're just exhausting your memory there---slurping the whole file that has a couple of gigabytes uncompressed is already likely to bring common desktops to their limit, and then building the DOM on that will fail on anything but the biggest irons.

To fix this, you'll have to say goodbye to the convenient all-in-memory DOM, but luckily XML::Twig allows for almost the same convenience with almost the low memory consumption of an XML stream parser like XML::SAX. Something like this should do it:

use XML::Twig; use IO::Handle; my $stdin = IO::Handle->new(); $stdin->fdopen(fileno(STDIN),"r") or die "fdopen STDIN: $!"; XML::Twig->new( twig_handlers => { 's' => \&process_sentence } )->safe_parse($stdin); sub process_section { my( $t, $elem) = @_; ... $elem->purge; # don't want to print the original text }
$t->text in the process_section callback returns the sentence text so it should be equivalent to $chunk in your for my $chunk ( $dom->find('s')->each ) { loop except that it doesn't include the start/end tags and drops any tags that might appear within the sentence.

I used to work with the BNC at university but almost always via SARA so I can't remember specifics about its markup. ISTR that they used some weird entities though so maybe you have to use the keep_encoding< code> option to <code>new()

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1017555]
help
Chatterbox?
[MidLifeXis]: heh.
[MidLifeXis]: Most likely it is a code that some undocumented system, hidden behind layers of IT, deep in the bowels of the building under the machine room floor, reads that code to keep a presence switch from going off. :-b
[MidLifeXis]: I think I forgot "running on a farm of commodore 64, vic 20s, trs 80s, and apple ]|[e systems"
[GotToBTru]: oh I know what it is .. but it is a number only slightly useful to me and of no possible use to our customer
[MidLifeXis]: Whew - you just saved the free world. <o)
[GotToBTru]: i guess it's a placeholder, the code will only fill it in if there is nothing else to use
[GotToBTru]: but then .. if you have nothing to say, why not say nothing?
[MidLifeXis]: I have a user who has a lot of say on how some of our processes work that abhors significant blanks. Perhaps that is a part of it. A not-so-obvious "this space intentionally left blank" indicator.

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2017-01-20 19:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you watch meteor showers?




    Results (176 votes). Check out past polls.