http://www.perlmonks.org?node_id=641910


in reply to Slurping a large (>65 gb) into buffer

Hi,
I'd use the $/ (a.k.a. $INPUT_RECORD_SEPARATOR) variable to set to the line that separates the html pages and read one "page" at a time:
#!/usr/bin/perl use strict; use warnings; $/ = '---this line is the separator---'; while (<DATA>) { $_ =~ s#$/##; # strip the separator line print "Data to process: \n$_\n"; } __DATA__ <html> <body>file 1</body> </html> ---this line is the separator--- <html> <body>file 2</body> </html> ---this line is the separator--- <html> <body>file 3</body> </html> ---this line is the separator--- <html> <body>file 4</body> </html> ---this line is the separator--- <html> <body>file 5</body> </html> ---this line is the separator---

Regards,
svenXY

PS: --who on earth puts 65GB of HTML pages in one file?

Replies are listed 'Best First'.
Re^2: Slurping a large (>65 gb) into buffer
by tcf03 (Deacon) on Oct 01, 2007 at 16:26 UTC
    Not sure if this would be an acceptable solution, or rather the start of one but if you have some process which reads all that html into a file, say, off the web. You could dump the pages using Storable and save some disk space. And perhaps some time in processing when reading back.
    #!/usr/bin/perl use strict; use warnings; use Storable; use Data::Dumper; $/ = '---this line is the separator---'; open my $html, 'test.html' or die "unable to open test.html: $!\n"; my %PAGE; my $pagecount = 0; while (<$html>) { $_ =~ s#$/##; # strip the separator line $pagecount++; $PAGE{$pagecount} = $_; store \%PAGE, "Test.file"; #print "Data to process: \n$_\n"; } my $pages = retrieve('Test.file'); print Dumper $pages;
    Ted
    --
    "That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
      --Ralph Waldo Emerson