Reading a VERY LARGE file with SINGLE line as content!

biswanath_c has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reading a VERY LARGE file with SINGLE line as content! by BrowserUk (Patriarch) on Jul 17, 2009 at 22:22 UTC
How are you going to processes it? Ie. Can you process it in smaller chunks than the whole file? If so, read it in smaller chunks using read (or by setting (say) `$/ = \4192;`). Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP PCW	[reply] [d/l]
Re: Reading a VERY LARGE file with SINGLE line as content! by JavaFan (Canon) on Jul 17, 2009 at 22:21 UTC
Well, it depends what you need of that file. If you need the entire line at once, you'll need the entire line. If you just want chunks of say, 1024 characters, read in the file in 1024 character chucks. Either by setting `$/ = \1024`, or by using `sysread`. If you only need a small section of the file, and you know where it is, use `(sys)seek` to get there, then read in the amount of data you need to get.	[reply] [d/l] [select]
Re^2: Reading a VERY LARGE file with SINGLE line as content! by Marshall (Canon) on Jul 18, 2009 at 08:29 UTC
Here is a snippet from some code that I wrote "many moons ago". This code will run not just a "little bit" faster than a Win command line...the performance difference is HUGE, even with just 8K buffer. If the search in the buffer can be run with say just two of these 8K buffers, it will be very, very fast. This is a copy routine, but same principle works for reading large files. showfailed() is a tricky thing that is sort of like die() and warn() and has a GUI display context. For the purpose here, it doesn't even matter. ################## # Binary File Copy and Append # # bcopy($output, @input_files); # # first element is the output file path, # # then input files: file1, file2...file n. sub bcopy() { (my $out, my @in_list)=@_; open (OUTBIN, ">", "$out") \|\| showfailed ("unable to open $out"); binmode(OUTBIN) \|\| showfailed ("unable to set binmode $out"); foreach my $infile (@in_list) { open(INBIN, "<", "$infile")\|\| showfailed ("unable to open $infile"); binmode(INBIN) \|\| showfailed ("unable to set binmode $infile"); while (read(INBIN, my $buff, 8 * 2**10)) { print OUTBIN $buff; } close(INBIN) \|\| showfailed("unable to close $infile"); print "$infile appended to $out\n"; } close(OUTBIN) \|\| showfailed("unable to close $out"); } #end of bcopy [download]	[reply] [d/l]
Re^3: Reading a VERY LARGE file with SINGLE line as content! by BrowserUk (Patriarch) on Jul 18, 2009 at 09:25 UTC
This code will run not just a "little bit" faster than a Win command line...the performance difference is HUGE, even with just 8K buffer. That's some strange code and a big claim. I thought I test the claim and my first attempt to call bcopy copy got: `Too many arguments for main::bcopy at C:\test\junk8.pl line 34, near " +@in )"` [download] Once I removed the useless prototype: #! perl -sw use 5.010; use strict; sub bcopy { (my $out, my @in_list)=@_; open (OUTBIN, ">", "$out") \|\| showfailed ("unable to open $out"); binmode(OUTBIN) \|\| showfailed ("unable to set binmode $out"); foreach my $infile (@in_list) { open(INBIN, "<", "$infile")\|\| showfailed ("unable to open $infile"); binmode(INBIN) \|\| showfailed ("unable to set binmode $infile"); while (read(INBIN, my $buff, 8 * 2*10)) { print OUTBIN $buff; } close(INBIN) \|\| showfailed("unable to close $infile"); print "$infile appended to $out\n"; } close(OUTBIN) \|\| showfailed("unable to close $out"); } my @in = glob shift; my $out = shift; say time; bcopy( $out, @in ); say time; [download] and ran it on 10x 128meg files: `[10:11:26.67} C:\test>junk8 .jnk bigjnk.out 1247908624 bugjunk1.jnk appended to bigjnk.out bugjunk10.jnk appended to bigjnk.out bugjunk2.jnk appended to bigjnk.out bugjunk3.jnk appended to bigjnk.out bugjunk4.jnk appended to bigjnk.out bugjunk5.jnk appended to bigjnk.out bugjunk6.jnk appended to bigjnk.out bugjunk7.jnk appended to bigjnk.out bugjunk8.jnk appended to bigjnk.out bugjunk9.jnk appended to bigjnk.out 1247908656` [download] 32 seconds. Then again with xcopy: `[10:22:58.79} C:\test>xcopy /Y *.jnk bigjnk.out Does bigjnk.out specify a file name or directory name on the target (F = file, D = directory)? f C:bugjunk1.jnk C:bugjunk10.jnk C:bugjunk2.jnk C:bugjunk3.jnk C:bugjunk4.jnk C:bugjunk5.jnk C:bugjunk6.jnk C:bugjunk7.jnk C:bugjunk8.jnk C:bugjunk9.jnk 10 File(s) copied [10:23:06.51} C:\test>` [download] Even with time it took me to respond to the dumb prompt, took just 8 seconds. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP PCW	[reply] [d/l] [select]
Re^4: Reading a VERY LARGE file with SINGLE line as content! by Marshall (Canon) on Jul 18, 2009 at 10:01 UTC
Re: Reading a VERY LARGE file with SINGLE line as content! by linuxer (Curate) on Jul 17, 2009 at 22:28 UTC
I don't see a problem with your code. I tried it on my machine and it worked fine. It just takes some time to read the data from disk, almost 10 seconds... What did you do to investigate this issue? Did you make sure, that the script really hangs at that point? Maybe the data is read correctly and the script hangs in a loop afterwards? Right now, it is just guessing what might be happening in the background... Can you provide more details? You could also try to use read or sysread to read chunk by chunk from the filehandle. I think it will take more time, but it would try to read the data in "one run"...	[reply]
Re: Reading a VERY LARGE file with SINGLE line as content! by Perlbotics (Archbishop) on Jul 17, 2009 at 22:24 UTC
That depends on what you would like to do with the single line. Is it text? Is it binary information? If you can process that line piecewise, than you could use sysread for example. Binary information is accessible by unpack after reading the data piecewise. (Updated after JavaFans comment.) A sliding window that is big enough to hold more characters or bytes than the biggest chunk of information your program needs to process might come handy too. HTH	[reply]
Re^2: Reading a VERY LARGE file with SINGLE line as content! by JavaFan (Canon) on Jul 17, 2009 at 22:28 UTC
I'm not sure how unpack is going to help you. To be able to use unpack, you'd first have to read in the data. And that's where the OP has the problem. And if the line read in is already to large, think how the memory usage will be if each couple of bytes is turned into a different SV.	[reply]
Re: Reading a VERY LARGE file with SINGLE line as content! by Bloodnok (Vicar) on Jul 18, 2009 at 15:46 UTC
As an alternative to reading fixed length data blocks, does the source data comprise records separated by char(s) other than `\n` ? If so, you could try setting `$/` to this value (thus modifying the input record separator) and using Marshalls' suggestion. A user level that continues to overstate my experience :-))	[reply] [d/l] [select]
Re^2: Reading a VERY LARGE file with SINGLE line as content! by Marshall (Canon) on Jul 22, 2009 at 03:19 UTC
I think you're right here. I suspect the OP's app gets into a disk thrashing mode due to lack of physical memory. To muck around with a 124 MB file, I would figure that a Win XP system needs at least 1 GB of physical memory. If the system has just 512 MB of memory, it will just "auger into the ground". I am a fan of taskinfo http://www.iarsn.com/taskinfo.html I don't get any "kickback" from this. I have used this program to solve some complex Windows problems. I don't mind giving Igor's program a recommendation because it works well.	[reply]
Re: Reading a VERY LARGE file with SINGLE line as content! by Marshall (Canon) on Jul 18, 2009 at 00:19 UTC
I like this: `open(DAT, $data_file) \|\| die("Could not open file!"); print "file opened successfully!! \n";` [download] Now I would suggest: `while (<DAT>) { ...do something.... ... print if /$pattern/; # simple first attempt }` [download] The above will work as it processes <DAT> line by line and prints. Now you want a subset of the <DAT> file and start working on how to do that.	[reply] [d/l] [select]
Re^2: Reading a VERY LARGE file with SINGLE line as content! by JavaFan (Canon) on Jul 18, 2009 at 00:36 UTC
That's not really any different from what the OP is doing, is it? The problem is, as explained by the OP, that the file only contains a single line. Reading that single line is a problem. Reading that single line from the guard of a while statement isn't suddenly going to fix it.	[reply]
Re^3: Reading a VERY LARGE file with SINGLE line as content! by Marshall (Canon) on Jul 18, 2009 at 01:14 UTC
This is a misunderstanding. The disk system (hardware, micro-code on disk hardware, I/F board (hardware and also micro-code) in conjunction with O/S driver and O/S does a lot. I have a prototype IBM drive that after 11 years is failing. I lost a Seagate and a WD drive last year during a massive power failure. ALL of these sub-systems thing fail. I am not saying different. The question appeared to be "how to do I simulate a failure"? I tried to help with that question. All hard drives will fail. It is not "if". It is just "when". I tried to help simulate "when". Update: OP says that he has 120+ MB of data and no "\n" line breaks. I don't believe that because it is so far out of the norm that a question is reasonable.	[reply]
Re^4: Reading a VERY LARGE file with SINGLE line as content! by Bloodnok (Vicar) on Jul 18, 2009 at 15:35 UTC


Keep It Simple, Stupid
	PerlMonks