Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Slurping a large (>65 gb) into buffer

by svenXY (Deacon)
on Oct 01, 2007 at 14:58 UTC ( #641910=note: print w/replies, xml ) Need Help??


in reply to Slurping a large (>65 gb) into buffer

Hi,
I'd use the $/ (a.k.a. $INPUT_RECORD_SEPARATOR) variable to set to the line that separates the html pages and read one "page" at a time:
#!/usr/bin/perl use strict; use warnings; $/ = '---this line is the separator---'; while (<DATA>) { $_ =~ s#$/##; # strip the separator line print "Data to process: \n$_\n"; } __DATA__ <html> <body>file 1</body> </html> ---this line is the separator--- <html> <body>file 2</body> </html> ---this line is the separator--- <html> <body>file 3</body> </html> ---this line is the separator--- <html> <body>file 4</body> </html> ---this line is the separator--- <html> <body>file 5</body> </html> ---this line is the separator---

Regards,
svenXY

PS: --who on earth puts 65GB of HTML pages in one file?

Replies are listed 'Best First'.
Re^2: Slurping a large (>65 gb) into buffer
by tcf03 (Deacon) on Oct 01, 2007 at 16:26 UTC
    Not sure if this would be an acceptable solution, or rather the start of one but if you have some process which reads all that html into a file, say, off the web. You could dump the pages using Storable and save some disk space. And perhaps some time in processing when reading back.
    #!/usr/bin/perl use strict; use warnings; use Storable; use Data::Dumper; $/ = '---this line is the separator---'; open my $html, 'test.html' or die "unable to open test.html: $!\n"; my %PAGE; my $pagecount = 0; while (<$html>) { $_ =~ s#$/##; # strip the separator line $pagecount++; $PAGE{$pagecount} = $_; store \%PAGE, "Test.file"; #print "Data to process: \n$_\n"; } my $pages = retrieve('Test.file'); print Dumper $pages;
    Ted
    --
    "That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
      --Ralph Waldo Emerson

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://641910]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2022-06-28 12:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (91 votes). Check out past polls.

    Notices?