perl's ability to handle LARGE files

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: perl's ability to handle LARGE files by dragonchild (Archbishop) on Nov 28, 2005 at 15:46 UTC
It all depends on how much RAM you have and if you need a regex to cross a newline boundary. And grep/awk/sed won't necessarily be faster, depending on the regex. In the general case, you will want to read it line by line, applying the regex to each line as needed. Though, I would consider look at File::ReadBackwards if you just want to deal with the tail end. Plus, have you considered putting stuff in a database? My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re: perl's ability to handle LARGE files by Fletch (Bishop) on Nov 28, 2005 at 15:58 UTC
Putting aside the question of if Perl can handle large files, it'd make more sense to checkpoint where you last finished checking for whatever and resume from that point the next run rather than repeat the same work over and over again. See seek and tell.	[reply]
Re: perl's ability to handle LARGE files by marto (Cardinal) on Nov 28, 2005 at 15:49 UTC
Hi, Perhaps the Tie::File module could help you out when dealing with a huge file, one line at a time. Hope this helps. Martin	[reply]
Re: perl's ability to handle LARGE files by sweetblood (Prior) on Nov 28, 2005 at 15:46 UTC
Perl is quite capable of handling files in the gigabyte range, as long as your operating system can. As far as what's the best way, well, that would really depend on exactly what you need to do. HTH Sweetblood	[reply]
Re: perl's ability to handle LARGE files by davido (Cardinal) on Nov 28, 2005 at 16:50 UTC
You've probably already thought through this and know the answer, but just in case... Is there no alternative design to the one that creates a large file which keeps getting larger, passing the several-GB mark and beyond? It might be more efficient, from a searching standpoint to divide the dataset into records and storing them in a relational database for easy searching capability. If that's not a possibility, how about at least maintaining fixed-size records or entries in the data file, so that you can seek to specific records within the file quickly, without re-reading it constantly. You could even maintain a separate index file of where "matches" are known to exist. Of course this is all just speculation, but it seems that if you're re-scanning this file at various intervals, and the file is growing to multi-GB sizes, eventually you'll either need to split it up, or cache the search results to maintain scalability. Dave	[reply]
Re: perl's ability to handle LARGE files by ikegami (Patriarch) on Nov 28, 2005 at 15:47 UTC
Can Perl even handle files "a few good GBs" in size? I hear Perl's regexp are not quite as fast as the tools in order to accomodate its more powerful features. Have you tried benchmarking? The real difference, however, will be determined by how you write the regexp. There are often efficient and inefficient ways of writting regexps. Unless you have "a few good GBs" of memory and then some, line by line should be faster.	[reply]
Re: perl's ability to handle LARGE files by pboin (Deacon) on Nov 28, 2005 at 16:54 UTC
Can you do this? Yes. Should you? Maybe not. Regular expressions are very powerful. So powerful that they can bite you in nasty ways unless you really understand what you're asking for. Judging from your question, I'd suggest you shy away from regex for this volume of data, and maybe write non-regex code to get the job done. For the tip of the iceberg, SuperSearch on "regex performance" and do some reading...	[reply]


Perl: the Markov chain saw
	PerlMonks