Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Hi albascura,
your problem is in this line:

my $dom = Mojo::DOM->new->parse(scalar do { local $/; <STDIN> });

You're just exhausting your memory there---slurping the whole file that has a couple of gigabytes uncompressed is already likely to bring common desktops to their limit, and then building the DOM on that will fail on anything but the biggest irons.

To fix this, you'll have to say goodbye to the convenient all-in-memory DOM, but luckily XML::Twig allows for almost the same convenience with almost the low memory consumption of an XML stream parser like XML::SAX. Something like this should do it:

use XML::Twig; use IO::Handle; my $stdin = IO::Handle->new(); $stdin->fdopen(fileno(STDIN),"r") or die "fdopen STDIN: $!"; XML::Twig->new( twig_handlers => { 's' => \&process_sentence } )->safe_parse($stdin); sub process_section { my( $t, $elem) = @_; ... $elem->purge; # don't want to print the original text }
$t->text in the process_section callback returns the sentence text so it should be equivalent to $chunk in your for my $chunk ( $dom->find('s')->each ) { loop except that it doesn't include the start/end tags and drops any tags that might appear within the sentence.

I used to work with the BNC at university but almost always via SARA so I can't remember specifics about its markup. ISTR that they used some weird entities though so maybe you have to use the keep_encoding< code> option to <code>new()

In reply to Re: Handling very big gz.Z files by mbethke
in thread Handling very big gz.Z files by albascura

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others surveying the Monastery: (4)
    As of 2018-05-21 23:10 GMT
    Find Nodes?
      Voting Booth?