Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Hi albascura,
your problem is in this line:

my $dom = Mojo::DOM->new->parse(scalar do { local $/; <STDIN> });

You're just exhausting your memory there---slurping the whole file that has a couple of gigabytes uncompressed is already likely to bring common desktops to their limit, and then building the DOM on that will fail on anything but the biggest irons.

To fix this, you'll have to say goodbye to the convenient all-in-memory DOM, but luckily XML::Twig allows for almost the same convenience with almost the low memory consumption of an XML stream parser like XML::SAX. Something like this should do it:

use XML::Twig; use IO::Handle; my $stdin = IO::Handle->new(); $stdin->fdopen(fileno(STDIN),"r") or die "fdopen STDIN: $!"; XML::Twig->new( twig_handlers => { 's' => \&process_sentence } )->safe_parse($stdin); sub process_section { my( $t, $elem) = @_; ... $elem->purge; # don't want to print the original text }
$t->text in the process_section callback returns the sentence text so it should be equivalent to $chunk in your for my $chunk ( $dom->find('s')->each ) { loop except that it doesn't include the start/end tags and drops any tags that might appear within the sentence.

I used to work with the BNC at university but almost always via SARA so I can't remember specifics about its markup. ISTR that they used some weird entities though so maybe you have to use the keep_encoding< code> option to <code>new()


In reply to Re: Handling very big gz.Z files by mbethke
in thread Handling very big gz.Z files by albascura

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others examining the Monastery: (14)
    As of 2014-07-22 18:21 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (124 votes), past polls