Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Slurping a large (>65 gb) into buffer

by downer (Monk)
on Oct 01, 2007 at 14:38 UTC ( #641903=perlquestion: print w/replies, xml ) Need Help??

downer has asked for the wisdom of the Perl Monks concerning the following question:

This question concerns the processing of a large file. The file consists of many HTML pages, with a certain line separating the pages.

I would slurp a large amount of this data into memory. With the data that I now have in memory, I'd like to process all the complete pages available there. At the end, if there is only a partial page, I would like to slurp more into the buffer, thus completing that pages information and repeating the process.

sounds easy, but my programming is so poor. I beg the perl monks for guidance!

Replies are listed 'Best First'.
Re: Slurping a large (>65 gb) into buffer
by svenXY (Deacon) on Oct 01, 2007 at 14:58 UTC
    Hi,
    I'd use the $/ (a.k.a. $INPUT_RECORD_SEPARATOR) variable to set to the line that separates the html pages and read one "page" at a time:

    Regards,
    svenXY

    PS: --who on earth puts 65GB of HTML pages in one file?
      Not sure if this would be an acceptable solution, or rather the start of one but if you have some process which reads all that html into a file, say, off the web. You could dump the pages using Storable and save some disk space. And perhaps some time in processing when reading back.
      #!/usr/bin/perl use strict; use warnings; use Storable; use Data::Dumper; $/ = '---this line is the separator---'; open my $html, 'test.html' or die "unable to open test.html: $!\n"; my %PAGE; my $pagecount = 0; while (<$html>) { $_ =~ s#$/##; # strip the separator line $pagecount++; $PAGE{$pagecount} = $_; store \%PAGE, "Test.file"; #print "Data to process: \n$_\n"; } my $pages = retrieve('Test.file'); print Dumper $pages;
      Ted
      --
      "That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
        --Ralph Waldo Emerson
Re: Slurping a large (>65 gb) into buffer
by kyle (Abbot) on Oct 01, 2007 at 15:01 UTC

    If the line separating the pages is always the same, you could set $/, aka $INPUT_RECORD_SEPARATOR (see perlvar) and read a page at a time pretty easily.

    $/ = "----- PAGE SEPARATAAAR -----\n"; while (<>) { chomp; # $_ now contains the HTML page }

    If the separator is something you'd have to match with a regex, you could read a line at a time and detect page boundaries.

    my $page = ''; while (<>) { if ( /xxx PAGE BOUNDARY \d+ xxx/ ) { output( $page ); $page = ''; next; } $page .= $_; }
Re: Slurping a large (>65 gb) into buffer
by perrin (Chancellor) on Oct 01, 2007 at 15:18 UTC
Re: Slurping a large (>65 gb) into buffer
by vkon (Curate) on Oct 01, 2007 at 15:14 UTC
    do you mean 65Gb, or it is 65Mb actually?

    65Gb is too large to be reasonable, and I would advise some multi-pass approach to the task, in this case.

    But given it is 64Mbytes actually, then process it all-at-once.

Re: Slurping a large (>65 gb) into buffer
by Anonymous Monk on Oct 01, 2007 at 15:40 UTC
    65 Gigabytes? I'm sorry, but 1) is that even possible with common operating systems? and 2) if you've got files that large, have you considered that, except for content like movies, this may not be the most rational way to store them?

      I'm just answering the "is that even possible with common operating systems",... yes and for some time. The question really isn't the operating system as a whole but more of a "can the filesystem handle really big files (>32GBytes)". IIRC... JFS, XFS and ZFS can all handle files orders of magnitude greater than 32GB.

      Jason L. Froebe

      Blog, Tech Blog

Re: Slurping a large (>65 gb) into buffer
by aquarium (Curate) on Oct 01, 2007 at 15:03 UTC
    pardon my ignorance...but why the need to slurp in clusters of html pages? Is that meant to increase efficiency somehow?..or is there some other requirement for this multi-page read?
    the hardest line to type correctly is: stty erase ^H
      Yes, 65 GB. I am downloaded this data from a respected source, now I am trying to incrementally get as much as i can get into memory, process it, and get some more. I think setting the record separator will be useful.
        You didn't answer aquarium's question, which was: why do you plan to load more than a page at a time?
        Yes, 65 GB. I am downloaded this data from a respected source
        /me .oO( Hugh Hefner )

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://641903]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2022-06-28 06:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (90 votes). Check out past polls.

    Notices?