Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

how can i work with huge files?

by morfeas (Novice)
on Feb 08, 2005 at 10:29 UTC ( #428991=perlquestion: print w/replies, xml ) Need Help??
morfeas has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I have a problm, which i suppose is perl solutable. I need to process very large files (minimum 30MB) in order to transfom them into something else. I found out that xml faces the same problem, when an xsl is used for transformations. How can we process the files without loading them to memory ? [just the parts we need. (say the places that the xsl indicates)?] thnks

2005-02-08 Janitored by Arunbear - converted square brackets to entities, to avoid creation of bogus links

Replies are listed 'Best First'.
Re: how can i work with huge files?
by Random_Walk (Prior) on Feb 08, 2005 at 10:51 UTC

    Perl has a default method of reading a file one line at a time, you can then process this line and write it back out to a new file.

    open FILE_HANDLE, "<", $File_Name or die "probs opening $File_Name: $! +\n"; while (<FILE_HANDLE>) { # defaults to one line at a time # do something to this line that is now # contained in the $_ variable }
    You can tell perl to use any string of characters as an input record seperator (the $/ variable) and then read your file one record at a time, this defaults to the end of line character for your OS so perl reads the file line at a time.
    # imagine each record starts with "New Record" local $/="New Record"; open FILE_HANDLE, "<", $File_Name or die "probs opening $File_Name: $! +\n"; while (<FILE_HANDLE>) { # do something to this block now # contained in the $_ variable }

    There are also various modules that tie a file on disk to a data structure without reading it all in. This may also be helpful. Have a look at Tie::File for instance.

    If you give some more specific details of what you are trying to do we can give more help.


    Pereant, qui ante nos nostra dixerunt!
Re: how can i work with huge files?
by FitTrend (Pilgrim) on Feb 08, 2005 at 15:04 UTC

    If the content lends itself to be organized via columns and records, would it be more beneficial to convert this data to a database to do the tasks required? This could save loading the entire content into memory to process it. Secondly, you could perform multiple tasks via DBI.


Re: how can i work with huge files?
by Anonymous Monk on Feb 08, 2005 at 10:54 UTC
    30 MB a very large file? Where are you from? 1980?

    Anyway, you avoid loading the entire file into memory by reading it one bit at the time. For instance, by reading it line by line (one can set $/ to define ones line ending), or by reading in a certain number of bytes (set $/ to a reference to an integer, or use (sys)read. You may want to (sys)seek to skip parts of the file - but that only works if you know where to go to.

      yes. the smaller is about 30MB. Now imagine some hundrents or thousants of such files needing to be processed by the script we mentioned. Quite '80s e? Further more say how hours should take, if for ONE 30MB needs 30 might reffer to 1980 hours... I need something more (P)ractical(erl) Anyway, thinking that tie:File have some potential on it ...
        Since we don't know what you do to a file, how can we asses the "30 minutes to process 30Mb"? Perhaps you have been very clever to do the task at hand in just 30 minutes. Perhaps you have used an extremely slow algorithm, and it could be done in a few seconds. Who knows? Well, you know, but we don't.

        But I remain, as long as you think that 30 Mb is a "very large file", you're not from this era.

Re: how can i work with huge files?
by Anonymous Monk on Feb 08, 2005 at 15:29 UTC
    The same problem with XML and XSL is the same problem we encounter in my company. We first worked out with XML::Twig, but in big files crash out. We manage to make a version on using Twigs API, without parsing etc, but again the time wasnt good. While mentioned, take a look in microsoft .net studio. This is the adopted solution in my company. In .net they try to handle xml with various ways in order to accomplish some flexibility. It might is the case where you should change from perl to net studio.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://428991]
Approved by Arunbear
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (12)
As of 2018-05-23 17:53 GMT
Find Nodes?
    Voting Booth?