Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Handling large files

by tsk1979 (Scribe)
on Apr 27, 2006 at 05:01 UTC ( [id://545950]=perlquestion: print w/replies, xml ) Need Help??

tsk1979 has asked for the wisdom of the Perl Monks concerning the following question:

I was having a discussion on the CB about how to reduce overheads with very large files where the area of interest is just a small part of the file Found solution to some but not to all Case 1: Last 100 lines of the file(total 1000000 lines) is of interest. Solution use tail -100 on the file and process Also if end of line is what interests, File::ReadBackwards is there. But the problem which I could not find a solution to is this. I have a file which has 1000000 lines, but the area of interest is only after the certain regexp is found in the file. Standard solution
open FH, "file"; while (<FH>) { last if /regexp/; } while (<FH>) { do something; }
Now in the above case suppose I know the exact line number at which the regexp occurs, what I could do is a tail ($length - $line.no) and process the file. But if I dont then the above solution will not work. Also I was thinking If I know that the regexp occurs near the end of the file I cound use a readbackwards to get the file pointer till that regexp point and then start parsing the file. For that I need to know if there is a way to take the pointer to a certain line number. For example If I want while(<FH) {} to start parsing from line no 10000, how to do that?

Replies are listed 'Best First'.
Re: Handling large files
by davido (Cardinal) on Apr 27, 2006 at 05:50 UTC

    The problem is that unless you read the file and count record terminators (usually '\n'), you can't possibly know how many bytes into the file a particular line is. ...unless each line is of uniform length, in which case simple math and seek is all it takes. But the fact is that there is no master index, by default, of a file telling Perl or any other program what offset within the file each line starts at.

    Imagine the scenario of someone telling you "Take the third right turn after the stop light." Then you get in your car, and decide, "I'll bet the third right past the stop light is in exactly 2.3 miles."... having never looked at a map and never driven the road before. If you blindly turn right at 2.3 miles, you're going to end up running into a house or something, because you cannot possibly know the exact mileage to that third right turn until you've driven to it at least once, and having done so, taken notice of the mileage.

    So there's the rub. If you want to find a particular point in a file, but you don't know exactly where that point is going to be, you're going to have to skim through the file until you find it. If you're lucky enough to have a situation where the file's future modifications are within your control, you should be able to at least document where that third right turn is found, and keep your "index" up to date if the position ever changes.

    This isn't a problem unique to Perl. It's not even a problem unique to computers. Right now, without looking at any table of contents, find me the first page of chapter three in the book To Kill a Mockingbird. You can't find it without physically skimming through the book.

    To answer the second part of your question... If you use File::Readbackwards to find that position within the file where the regexp is located, you can use tell to ascertain where, within the file, you actually are. tell gives an absolute location, not related to things like newlines or delimiters. That location can be used later by seek to set your next read/write position within a file.


    Dave

Re: Handling large files
by TedPride (Priest) on Apr 27, 2006 at 06:11 UTC
    Actually, a much more efficient method for adding new records to the file would be an additional overflow file, which would take up to maybe 100 records and then get merged with the main file. This saves you having to move (on average) half of a 1,000,000 record file every time you want to insert something.

    And I'd personally just convert the files to some standard database format and work with them through the Perl database handling modules. Why make life more difficult for yourself? mySQL handles 1,000,000-record tables with ease, especially if you use indexes.

Re: Handling large files
by GrandFather (Saint) on Apr 27, 2006 at 05:09 UTC

    Unless the lines are fixed length or you cache line start indexes, you can't.

    If the file is growing then it may be worth keeping a second file that caches line start indexes every few hundred lines. Should be low overhead and easy to implement.


    DWIM is Perl's answer to Gödel
Re: Handling large files
by ioannis (Abbot) on Apr 27, 2006 at 06:45 UTC
    I have PerlIO::via::Skip on cpan that reads lines with PerlIO. It can read the lines after it finds the pattern, between patterns, can skips comments, or skip a number of lines after it finds the pattern, etc,. For the case of reading after line 1000, you will probably need to pattern-match the first line and request your lines with the 'after=>1000' parameter.
Re: Handling large files
by cdarke (Prior) on Apr 27, 2006 at 11:46 UTC
    BTW, you can't read a file backwards, at least not on any architectures I have seen. File::ReadBackwards cheats by seeking to within n bytes of EOF, then caching a chunk of data by reading forwards. It only appears to be reading in reverse.

      That is correct; by default it grabs 8kb chunks, starting at the end. The implementation is quite efficient, however, and the method used to achieve the appearance of reading backwards is fairly transparent to the user.


      Dave

Re: Handling large files
by unobe (Scribe) on Apr 28, 2006 at 04:34 UTC
    Is the data in the file in some sort of structure by the time you get to the regex you're looking for? If so, maybe using Tie::File to skip however many records are before the regex. You can also go so many records from the last one in the file, without loading the whole file into memory. HTH

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://545950]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-03-19 05:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found