Re: Slurping a large (>65 gb) into buffer
by svenXY (Deacon) on Oct 01, 2007 at 14:58 UTC
|
Hi,
I'd use the $/ (a.k.a. $INPUT_RECORD_SEPARATOR) variable to set to the line that separates the html pages and read one "page" at a time:
Regards,
svenXY
PS: --who on earth puts 65GB of HTML pages in one file? | [reply] [d/l] |
|
Not sure if this would be an acceptable solution, or rather the start of one but if you have some process which reads all that html into a file, say, off the web. You could dump the pages using Storable and save some disk space. And perhaps some time in processing when reading back.
#!/usr/bin/perl
use strict;
use warnings;
use Storable;
use Data::Dumper;
$/ = '---this line is the separator---';
open my $html, 'test.html' or die "unable to open test.html: $!\n";
my %PAGE;
my $pagecount = 0;
while (<$html>) {
$_ =~ s#$/##; # strip the separator line
$pagecount++;
$PAGE{$pagecount} = $_;
store \%PAGE, "Test.file";
#print "Data to process: \n$_\n";
}
my $pages = retrieve('Test.file');
print Dumper $pages;
Ted
--
"That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
--Ralph Waldo Emerson
| [reply] [d/l] |
Re: Slurping a large (>65 gb) into buffer
by kyle (Abbot) on Oct 01, 2007 at 15:01 UTC
|
If the line separating the pages is always the same, you could set $/, aka $INPUT_RECORD_SEPARATOR (see perlvar) and read a page at a time pretty easily.
$/ = "----- PAGE SEPARATAAAR -----\n";
while (<>) {
chomp;
# $_ now contains the HTML page
}
If the separator is something you'd have to match with a regex, you could read a line at a time and detect page boundaries.
my $page = '';
while (<>) {
if ( /xxx PAGE BOUNDARY \d+ xxx/ ) {
output( $page );
$page = '';
next;
}
$page .= $_;
}
| [reply] [d/l] [select] |
Re: Slurping a large (>65 gb) into buffer
by perrin (Chancellor) on Oct 01, 2007 at 15:18 UTC
|
| [reply] |
Re: Slurping a large (>65 gb) into buffer
by vkon (Curate) on Oct 01, 2007 at 15:14 UTC
|
do you mean 65Gb, or it is 65Mb actually?
65Gb is too large to be reasonable, and I would advise some multi-pass approach to the task, in this case.
But given it is 64Mbytes actually, then process it all-at-once. | [reply] |
Re: Slurping a large (>65 gb) into buffer
by Anonymous Monk on Oct 01, 2007 at 15:40 UTC
|
65 Gigabytes?
I'm sorry, but 1) is that even possible with common operating systems? and 2) if you've got files that large, have you considered that, except for content like movies, this may not be the most rational way to store them?
| [reply] |
|
I'm just answering the "is that even possible with common operating systems",... yes and for some time. The question really isn't the operating system as a whole but more of a "can the filesystem handle really big files (>32GBytes)". IIRC... JFS, XFS and ZFS can all handle files orders of magnitude greater than 32GB.
| [reply] |
Re: Slurping a large (>65 gb) into buffer
by aquarium (Curate) on Oct 01, 2007 at 15:03 UTC
|
| [reply] |
|
Yes, 65 GB. I am downloaded this data from a respected source, now I am trying to incrementally get as much as i can get into memory, process it, and get some more. I think setting the record separator will be useful.
| [reply] |
|
You didn't answer aquarium's question, which was: why do you plan to load more than a page at a time?
| [reply] |
|
Yes, 65 GB. I am downloaded this data from a respected source
/me .oO( Hugh Hefner )
| [reply] |