processing large files

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re (tilly) 1: processing large files by tilly (Archbishop) on Jul 04, 2001 at 06:16 UTC
The issue is that Perl's internal representation of a file is 32-bit. So when Perl tries to seek to a particular location of a very large file, well Perl doesn't understand what position 5_003_904_123 is. There are two solutions. One is to compile a new version of Perl that does understand how to handle large files. The other is to keep Perl from trying to seek. The easiest way to do the other is for Perl to read from a pipe, not a filehandle. That can be done by writing your script so that it reads from STDIN, or by converting your opens so that they look kind of like this: `open(FILE, "cat $file \|") or die "Cannot cat '$file': $!";` [download]	[reply] [d/l]
Re: Re (tilly) 1: processing large files by sierrathedog04 (Hermit) on Jul 04, 2001 at 18:38 UTC
A lot of languages claim to have "no arbitrary size limits" for strings and files or that the limits to the sizes of their thingies is limited only by available memory or hard disk space. What I have learned from Tilly's post is that we Perl advocates cannot make such a claim. Perl programs can break if confronted with a file size of greater than 2 GB. And since hard disks often come with much more space than that, and since the poster obviously has a need to work with such files, this deficiency is not a trivial one. "Nothing is difficult for the man who doesn't have to do it himself." All hail the worthy work of the Perl Porters who got us where we are today (with a little help from ActiveState and thus B. Gates.) I intend no disparagement of their magnificent work.	[reply]
Re (tilly) 3: processing large files by tilly (Archbishop) on Jul 04, 2001 at 21:25 UTC
I wouldn't take this particular one too badly. When Perl 5 came out it wasn't clear how the industry would handle the 32-bit barrier in file-size, so there was no way to write Perl support for it. You can hardly blame people for not writing support for what didn't yet exist. According to Dominic Dunlop Perl had limited support for 64-bit files in 5.005_03, and it is (as noted above) a compile-time option in 5.6. But that compile-time option will not work on all platforms, and not all people on platforms that do support it have used it. And note that support for 64-bit files needs to be present in the operating system. If you are running Linux, that support is first present in 2.4. If you are running FreeBSD it has been there for a few years now. Anyways all 32-bit computer applications have arbitrary limits imposed on them by the hardware. And the above question is the leading edge of a trainwreck we will see in slow motion over the next few years. The problem is that if your naming scheme is 32-bits, then it only has about 4 billion names. Waste a bit here or there, and you are limited to 1 or 2 billion. Segment your architecture in some way, and you find that real world limits tend to hit at 1, 2, 3, or 4 GB. Often with a hack (such as large file support or Intel's large RAM support) you can push that off in particular places. But, for instance, Perl on a 32-bit platform will never support manipulating a string of length 3 GB. It isn't going to happen. And Perl is not alone. But thanks to Moore's law, it is a question of time before people want to do exactly that. And so as users needs keep on crossing the magic threshold people at first find their workarounds, and then will have to switch to 64 bit platforms. Which won't be pretty, but it will happen. And the trillion dollar question is whose 64-bit chip is going to win. Right now people tend to use alphas. AMD's proposal is (I have heard) technically worse but makes for the easiest upgrade from x86. Intel has a huge amount of marketing muscle. In 5 years the answer will seem obvious in retrospect and everyone else is going to be playing catch-up. And playing catch-up for a very long time - the 128-bit conversion is decades off and there is no guarantee that Moore's law will continue until then.	[reply]
Re: Re (tilly) 3: processing large files by sierrathedog04 (Hermit) on Jul 04, 2001 at 23:29 UTC
Re (tilly) 5: processing large files by tilly (Archbishop) on Jul 04, 2001 at 23:48 UTC
Some notes below your chosen depth have not been shown here
Re: Re (tilly) 1: processing large files by Anonymous Monk on Jul 07, 2001 at 00:58 UTC
thanks, I tried opening the file with the "cat $file \|" and I still have the problem and the version of perl is compiled with the uselargefiles option - I checked. -E	[reply]
Re: Re: Re (tilly) 1: processing large files by Anonymous Monk on Jul 07, 2001 at 01:45 UTC
Thanks everybody! I did get it to work with using the uselargefile option on version 5.6.0, I just had a little glitch in the code and when I figured that out the uselargefile option was working. thanks again -E	[reply]
Re^3: Re (tilly) 1: processing large files by Anonymous Monk on Jun 22, 2013 at 17:16 UTC
Re^4: Re (tilly) 1: processing large files by Anonymous Monk on Jun 22, 2013 at 23:41 UTC
Re: processing large files by wog (Curate) on Jul 04, 2001 at 00:58 UTC
The `-Duselargefiles` option is supposed to be given to the Configure program when compiling perl. (Alternately you could answer all the prompts it gives if you don't give the options to turn those off.) According to the documentation given with perl's source kit, "in many common platforms like Linux or Solaris this support is on by default". You can test if your perl is compiled with large file support with `perl -V:uselargefiles` (should say `uselargefiles='define';` if they are enabled.) If large files are supposedly enabled you probably have a less easily solved problem. If they aren't you will need to (try to) recompile perl with large file support turned on. -- `perl -e '$\|--;$.=q$.$;while($i!=0){$i=(time-$^T)%60;print"\r".$.x$i}'`	[reply] [d/l] [select]
Re: Re: processing large files by Anonymous Monk on Jul 07, 2001 at 00:55 UTC
thanks, I checked if it was compiled with large files and it was but I still have the problem. -E	[reply]
Re: processing large files by voyager (Friar) on Jul 04, 2001 at 01:34 UTC
Your problem appears to be that the data file is too big. That would indicate you are reading the whole file into memory and then processing. You need to switch to an alogrithm that processes each line as it is read (if possible). If this is the case, post the code you have so far. I don't know about the -Duselargefiles switch, but it appears to refer to the perl scrpt size. Can anyone clarify?	[reply]
Re: Re: processing large files by filmo (Scribe) on Jul 04, 2001 at 22:44 UTC
I agree. Since you indicated that it is a file of records (plural), it seems like there would be a way to read each record individually and accomplish your goals. In fact, can someone give an example of the need to read a >2gig file and process it as a chunk -- binary files not withstanding. -- Filmo the Klown	[reply]
Re: Re: processing large files by Anonymous Monk on Jul 07, 2001 at 01:08 UTC
I think I am reading the file line by line here is some code from a test script I am trying to get to work: `open(IN_FILE, "cat $fileName \|" die "Can't cat inputfile: $!\n"; while (<IN_FILE>) { chop; ......some processing code... }` [download] any ideas? -E	[reply] [d/l]
Re: processing large files by Tuna (Friar) on Jul 04, 2001 at 05:18 UTC
As a network analyst for a tier-1 ISP, I can happily report that Perl can, and on a daily basis, does, process files in the neighborhood of several hundred gig. Are you sure that your limitation is not hardware-related?	[reply]
Re: Re: processing large files by Anonymous Monk on Jul 07, 2001 at 01:12 UTC
How could it be hardware related if these files are on disk at that size and C++ programs process them? how do you process your files with perl? Is it a specfic perl? or compile option? or algorithm? thanks -E	[reply]
Re: Re: Re: processing large files by Tuna (Friar) on Jul 07, 2001 at 01:46 UTC
Easily! However, you never said that you had other programs that were successfully processing your data. =) As far as my processing is concerned, I churn through Cisco NetFlow data, along with bgp tables, route summaries, snmp data, merge it all together and spit it back out again to create traffice matrices. The algorithms aren't that complicated at all; it's just a shite-load of data that I process continuously. If I tried to run on anything less than the E450 that it runs on, I would be in trouble.	[reply]


Just another Perl shrine
	PerlMonks

processing large files

"Nothing is difficult for the man who doesn't have to do it himself." All hail the worthy work of the Perl Porters who got us where we are today (with a little help from ActiveState and thus B. Gates.) I intend no disparagement of their magnificent work.