Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

perl's ability to handle LARGE files

by Anonymous Monk
on Nov 28, 2005 at 15:42 UTC ( #512221=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hey guys,

i need to run a regexp on a large file which will keep getting larger and larger. larger as in a few good GBs.

i need your advice on:
1. should i just drop perl regex' and use grep/sed/awk? i mean, would that be more efficient/fast?
2. what is the best way to regex the file? i mean, open it and read line by line and do the regexp
on it or put it all into an array? (probably not a good idea, right?)

thanks guys

Replies are listed 'Best First'.
Re: perl's ability to handle LARGE files
by dragonchild (Archbishop) on Nov 28, 2005 at 15:46 UTC
    It all depends on how much RAM you have and if you need a regex to cross a newline boundary. And grep/awk/sed won't necessarily be faster, depending on the regex.

    In the general case, you will want to read it line by line, applying the regex to each line as needed. Though, I would consider look at File::ReadBackwards if you just want to deal with the tail end. Plus, have you considered putting stuff in a database?

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: perl's ability to handle LARGE files
by Fletch (Chancellor) on Nov 28, 2005 at 15:58 UTC

    Putting aside the question of if Perl can handle large files, it'd make more sense to checkpoint where you last finished checking for whatever and resume from that point the next run rather than repeat the same work over and over again. See seek and tell.

Re: perl's ability to handle LARGE files
by marto (Bishop) on Nov 28, 2005 at 15:49 UTC

    Perhaps the Tie::File module could help you out when dealing with a huge file, one line at a time.

    Hope this helps.

Re: perl's ability to handle LARGE files
by sweetblood (Parson) on Nov 28, 2005 at 15:46 UTC
    Perl is quite capable of handling files in the gigabyte range, as long as your operating system can. As far as what's the best way, well, that would really depend on exactly what you need to do.



Re: perl's ability to handle LARGE files
by davido (Archbishop) on Nov 28, 2005 at 16:50 UTC

    You've probably already thought through this and know the answer, but just in case...

    Is there no alternative design to the one that creates a large file which keeps getting larger, passing the several-GB mark and beyond? It might be more efficient, from a searching standpoint to divide the dataset into records and storing them in a relational database for easy searching capability.

    If that's not a possibility, how about at least maintaining fixed-size records or entries in the data file, so that you can seek to specific records within the file quickly, without re-reading it constantly. You could even maintain a separate index file of where "matches" are known to exist.

    Of course this is all just speculation, but it seems that if you're re-scanning this file at various intervals, and the file is growing to multi-GB sizes, eventually you'll either need to split it up, or cache the search results to maintain scalability.


Re: perl's ability to handle LARGE files
by ikegami (Pope) on Nov 28, 2005 at 15:47 UTC

    Can Perl even handle files "a few good GBs" in size?

    I hear Perl's regexp are not quite as fast as the tools in order to accomodate its more powerful features. Have you tried benchmarking? The real difference, however, will be determined by how you write the regexp. There are often efficient and inefficient ways of writting regexps.

    Unless you have "a few good GBs" of memory and then some, line by line should be faster.

Re: perl's ability to handle LARGE files
by pboin (Deacon) on Nov 28, 2005 at 16:54 UTC

    Can you do this? Yes.

    Should you? Maybe not. Regular expressions are very powerful. So powerful that they can bite you in nasty ways unless you *really* understand what you're asking for.

    Judging from your question, I'd suggest you shy away from regex for this volume of data, and maybe write non-regex code to get the job done. For the tip of the iceberg, SuperSearch on "regex performance" and do some reading...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://512221]
Approved by sweetblood
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2017-01-20 01:51 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (173 votes). Check out past polls.