Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: Slurping a large (>65 gb) into buffer

by svenXY (Deacon)
on Oct 01, 2007 at 14:58 UTC ( [id://641910]=note: print w/replies, xml ) Need Help??

in reply to Slurping a large (>65 gb) into buffer

I'd use the $/ (a.k.a. $INPUT_RECORD_SEPARATOR) variable to set to the line that separates the html pages and read one "page" at a time:
#!/usr/bin/perl use strict; use warnings; $/ = '---this line is the separator---'; while (<DATA>) { $_ =~ s#$/##; # strip the separator line print "Data to process: \n$_\n"; } __DATA__ <html> <body>file 1</body> </html> ---this line is the separator--- <html> <body>file 2</body> </html> ---this line is the separator--- <html> <body>file 3</body> </html> ---this line is the separator--- <html> <body>file 4</body> </html> ---this line is the separator--- <html> <body>file 5</body> </html> ---this line is the separator---


PS: --who on earth puts 65GB of HTML pages in one file?

Replies are listed 'Best First'.
Re^2: Slurping a large (>65 gb) into buffer
by tcf03 (Deacon) on Oct 01, 2007 at 16:26 UTC
    Not sure if this would be an acceptable solution, or rather the start of one but if you have some process which reads all that html into a file, say, off the web. You could dump the pages using Storable and save some disk space. And perhaps some time in processing when reading back.
    #!/usr/bin/perl use strict; use warnings; use Storable; use Data::Dumper; $/ = '---this line is the separator---'; open my $html, 'test.html' or die "unable to open test.html: $!\n"; my %PAGE; my $pagecount = 0; while (<$html>) { $_ =~ s#$/##; # strip the separator line $pagecount++; $PAGE{$pagecount} = $_; store \%PAGE, "Test.file"; #print "Data to process: \n$_\n"; } my $pages = retrieve('Test.file'); print Dumper $pages;
    "That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
      --Ralph Waldo Emerson

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://641910]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-06-13 23:15 GMT
Find Nodes?
    Voting Booth?

    No recent polls found

    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.