Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
No such thing as a small change
 
PerlMonks  

Reading a VERY LARGE file with SINGLE line as content!

by biswanath_c (Beadle)
on Jul 17, 2009 at 22:08 UTC ( #781194=perlquestion: print w/ replies, xml ) Need Help??
biswanath_c has asked for the wisdom of the Perl Monks concerning the following question:


Hi

I have a requirement wherein i have to read a 124 MB file (could be even larger) but the filw has ONLY ONE line. So, when i try to do this:
open(DAT, $data_file) || die("Could not open file!"); print "file opened successfully!! \n"; $data=<DAT>;

The last line of this code :
 $data=<DAT>;
hangs - The CPU of my machine shoots to 74%. It does NOT reach 100%, but the script seems to just hang!


Can anyone suggest how i can handle this problem? I know that reading the file line by line is the best way but in my case, the WHOLE LARGE file has only one line! How can i solve this problem?


Thanks and regards

Biswanath

Comment on Reading a VERY LARGE file with SINGLE line as content!
Select or Download Code
Re: Reading a VERY LARGE file with SINGLE line as content!
by JavaFan (Canon) on Jul 17, 2009 at 22:21 UTC
    Well, it depends what you need of that file. If you need the entire line at once, you'll need the entire line. If you just want chunks of say, 1024 characters, read in the file in 1024 character chucks. Either by setting $/ = \1024, or by using sysread. If you only need a small section of the file, and you know where it is, use (sys)seek to get there, then read in the amount of data you need to get.
      Here is a snippet from some code that I wrote "many moons ago". This code will run not just a "little bit" faster than a Win command line...the performance difference is HUGE, even with just 8K buffer. If the search in the buffer can be run with say just two of these 8K buffers, it will be very, very fast. This is a copy routine, but same principle works for reading large files. showfailed() is a tricky thing that is sort of like die() and warn() and has a GUI display context. For the purpose here, it doesn't even matter.
      ################## # Binary File Copy and Append # # bcopy($output, @input_files); # # first element is the output file path, # # then input files: file1, file2...file n. sub bcopy() { (my $out, my @in_list)=@_; open (OUTBIN, ">", "$out") || showfailed ("unable to open $out"); binmode(OUTBIN) || showfailed ("unable to set binmode $out"); foreach my $infile (@in_list) { open(INBIN, "<", "$infile")|| showfailed ("unable to open $infile"); binmode(INBIN) || showfailed ("unable to set binmode $infile"); while (read(INBIN, my $buff, 8 * 2**10)) { print OUTBIN $buff; } close(INBIN) || showfailed("unable to close $infile"); print "$infile appended to $out\n"; } close(OUTBIN) || showfailed("unable to close $out"); } #end of bcopy
        This code will run not just a "little bit" faster than a Win command line...the performance difference is HUGE, even with just 8K buffer.

        That's some strange code and a big claim. I thought I test the claim and my first attempt to call bcopy copy got:

        Too many arguments for main::bcopy at C:\test\junk8.pl line 34, near " +@in )"

        Once I removed the useless prototype:

        #! perl -sw use 5.010; use strict; sub bcopy { (my $out, my @in_list)=@_; open (OUTBIN, ">", "$out") || showfailed ("unable to open $out"); binmode(OUTBIN) || showfailed ("unable to set binmode $out"); foreach my $infile (@in_list) { open(INBIN, "<", "$infile")|| showfailed ("unable to open $infile"); binmode(INBIN) || showfailed ("unable to set binmode $infile"); while (read(INBIN, my $buff, 8 * 2**10)) { print OUTBIN $buff; } close(INBIN) || showfailed("unable to close $infile"); print "$infile appended to $out\n"; } close(OUTBIN) || showfailed("unable to close $out"); } my @in = glob shift; my $out = shift; say time; bcopy( $out, @in ); say time;

        and ran it on 10x 128meg files:

        [10:11:26.67} C:\test>junk8 *.jnk bigjnk.out 1247908624 bugjunk1.jnk appended to bigjnk.out bugjunk10.jnk appended to bigjnk.out bugjunk2.jnk appended to bigjnk.out bugjunk3.jnk appended to bigjnk.out bugjunk4.jnk appended to bigjnk.out bugjunk5.jnk appended to bigjnk.out bugjunk6.jnk appended to bigjnk.out bugjunk7.jnk appended to bigjnk.out bugjunk8.jnk appended to bigjnk.out bugjunk9.jnk appended to bigjnk.out 1247908656

        32 seconds. Then again with xcopy:

        [10:22:58.79} C:\test>xcopy /Y *.jnk bigjnk.out Does bigjnk.out specify a file name or directory name on the target (F = file, D = directory)? f C:bugjunk1.jnk C:bugjunk10.jnk C:bugjunk2.jnk C:bugjunk3.jnk C:bugjunk4.jnk C:bugjunk5.jnk C:bugjunk6.jnk C:bugjunk7.jnk C:bugjunk8.jnk C:bugjunk9.jnk 10 File(s) copied [10:23:06.51} C:\test>

        Even with time it took me to respond to the dumb prompt, took just 8 seconds.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Reading a VERY LARGE file with SINGLE line as content!
by BrowserUk (Pope) on Jul 17, 2009 at 22:22 UTC

    How are you going to processes it? Ie. Can you process it in smaller chunks than the whole file? If so, read it in smaller chunks using read (or by setting (say) $/ = \4192;).


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Reading a VERY LARGE file with SINGLE line as content!
by Perlbotics (Abbot) on Jul 17, 2009 at 22:24 UTC

    That depends on what you would like to do with the single line. Is it text? Is it binary information? If you can process that line piecewise, than you could use sysread for example. Binary information is accessible by unpack after reading the data piecewise. (Updated after JavaFans comment.)
    A sliding window that is big enough to hold more characters or bytes than the biggest chunk of information your program needs to process might come handy too.

    HTH

      I'm not sure how unpack is going to help you. To be able to use unpack, you'd first have to read in the data. And that's where the OP has the problem. And if the line read in is already to large, think how the memory usage will be if each couple of bytes is turned into a different SV.
Re: Reading a VERY LARGE file with SINGLE line as content!
by linuxer (Deacon) on Jul 17, 2009 at 22:28 UTC

    I don't see a problem with your code. I tried it on my machine and it worked fine. It just takes some time to read the data from disk, almost 10 seconds...

    What did you do to investigate this issue?

    Did you make sure, that the script really hangs at that point?

    Maybe the data is read correctly and the script hangs in a loop afterwards?

    Right now, it is just guessing what might be happening in the background...

    Can you provide more details?

    You could also try to use read or sysread to read chunk by chunk from the filehandle. I think it will take more time, but it would try to read the data in "one run"...

Re: Reading a VERY LARGE file with SINGLE line as content!
by Marshall (Prior) on Jul 18, 2009 at 00:19 UTC
    I like this:
    open(DAT, $data_file) || die("Could not open file!"); print "file opened successfully!! \n";
    Now I would suggest:
    while (<DAT>) { ...do something.... ... print if /$pattern/; # simple first attempt }
    The above will work as it processes <DAT> line by line and prints.
    Now you want a subset of the <DAT> file and start working on how to do that.
      That's not really any different from what the OP is doing, is it? The problem is, as explained by the OP, that the file only contains a single line. Reading that single line is a problem. Reading that single line from the guard of a while statement isn't suddenly going to fix it.
        This is a misunderstanding. The disk system (hardware, micro-code on disk hardware, I/F board (hardware and also micro-code) in conjunction with O/S driver and O/S does a lot. I have a prototype IBM drive that after 11 years is failing. I lost a Seagate and a WD drive last year during a massive power failure.

        ALL of these sub-systems thing fail. I am not saying different. The question appeared to be "how to do I simulate a failure"? I tried to help with that question. All hard drives will fail. It is not "if". It is just "when". I tried to help simulate "when". Update:

        OP says that he has 120+ MB of data and no "\n" line breaks. I don't believe that because it is so far out of the norm that a question is reasonable.

Re: Reading a VERY LARGE file with SINGLE line as content!
by Bloodnok (Vicar) on Jul 18, 2009 at 15:46 UTC
    As an alternative to reading fixed length data blocks, does the source data comprise records separated by char(s) other than \n ?

    If so, you could try setting $/ to this value (thus modifying the input record separator) and using Marshalls' suggestion.

    A user level that continues to overstate my experience :-))
      I think you're right here. I suspect the OP's app gets into a disk thrashing mode due to lack of physical memory. To muck around with a 124 MB file, I would figure that a Win XP system needs at least 1 GB of physical memory. If the system has just 512 MB of memory, it will just "auger into the ground".

      I am a fan of taskinfo http://www.iarsn.com/taskinfo.html

      I don't get any "kickback" from this. I have used this program to solve some complex Windows problems. I don't mind giving Igor's program a recommendation because it works well.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://781194]
Approved by linuxer
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (11)
As of 2014-04-18 16:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (470 votes), past polls