Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Reducing Memory Usage

by PerlingTheUK (Hermit)
on Jul 16, 2004 at 07:10 UTC ( #374928=perlquestion: print w/ replies, xml ) Need Help??
PerlingTheUK has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a programm, that requires text files to be read into memory. The size of these files started as "small" 5 megabyte ones but has increased to about 125 megabytes. Loading these files into objects of my class structure requires about 1.5 GB of memory.
I cannot read only parts of this file as the data needs to be sorted in a very complicated way.
I now try to reduce the memory required. Unfortunately there seems to be little information about how to do so. I tried a couple of things so far:
  1. The text file I read, has fixed columns sizes, instead of splitting these into my object hash's key-value pairs, I just stored the lines into my objects using the substr function for all get-methods. I did that because I read that a scalar, requires a minimum amount of memory that would exceed the length of most of my values size. The effect was an obviously increased runtime as substr now was called about 16 million times per object, but no remarkable reduction in memory usage, which made me conclude that the minimum size required for a hash's key-value pair is much closer to zero than of ?32? Bytes for a scalar.
  2. I also tried limiting memory usage by removing useless whitespaces within a value, but there are few and memory reduction was less than 2 percent.
  3. My last and more successful effort was to reduce memory by leaving all values that have a standard value as undefined and just return the standard value. This reduced memory by about ten percent but still that is much less than required.
Are there any rescources about reducing memory usage out there? Anything that makes me understand how memory is allocated in scalars/arrays/hashes? Or can I possibly reduce the usage by enconding my values into strings using a wider range of characters and reducing the length of a string but still enabling me to sort those elements without decoding the values as this would very likely take too much time again.

Thanks.

Comment on Reducing Memory Usage
Re: Reducing Memory Usage
by knoebi (Friar) on Jul 16, 2004 at 07:26 UTC
    i don't know what you need to do with your file exactly, except sorting. you could give Tie-File a try. With it you can access every single line in the file, but not the whole file is loaded into the memory.

    ciao knoebi

      I have looked at that option briefly but as I have had problems with running the substr function, I believe it is not a practical alternative.
      I need to compare the 3rd to 11th character (location) of each line with every other line, if these characters are the same, I need to compare a time (12th to 15th) character and sort all of the lines according to that time. I also need to convert the time that is in a strange format every time i read it, so this data preparation is quite time consuming, and I do not want to run it ever single time I need the value.
      Ciao PerlingTheUK
        The facts I've found so far:
        • every line is 80 characters exactly
        • the lines need to be grouped by the field at offsets 2..10
        • the groups need to be sorted by the field at offsets 11..14
        • this second field is a coded time that needs to be decoded

        A possible strategy would be to first 'index' the file, by reading it line by line;

        (untested code follows)
        my %index; my $line=0; while (<FILE>) { my ($location,$time) = /^..(.{9})(.{4})/; push @{$index{$location}},[$time,$line]; $line++; }

        This would results in a hash keyed on the 'location', with the value being a reference to an array with contain the info you need to sort the lines. This seems to be the minumum amount of info needed to determine the sort order.

        The next step is to sort the arrays by the time values, you've stored, and fetch the lines in order from the file:

        (untested code again)
        for my $location (keys %index) { my @sorted = sort { $a->[0] <=> $b->[0]} @{$index{$location}}; for my $entry (@sorted) { seek FILE, 81 * $entry->[1], 0; read FILE, $line, 80; print $line,"\n"; } }

        This method should be very memory efficient I think, and not too slow either; the biggest slowdown is probably the seeking around in the file.

        This method works because we know the lengths of records. If we don't we could use the tell function before we read a line, to also store the exact start position of the line in the index...

Re: Reducing Memory Usage
by Anonymous Monk on Jul 16, 2004 at 07:26 UTC
    Not to put everything to memory. You need database.
      Yep I know would be right what I want but my company does not like that, as it is believed (and true) to take a lot of time for administration.
        If your company is unwilling to use the right too for the job, then you're out of luck. As for administration - no, it doesn't take much, if the database is dedicated to this program of yours and doesn't allow remote access. And even if it does take "a lot of time" you need to weigh that up against the costs of continuing as you are. And against the costs when your dataset grows even further.
        If your company doesn't want to use a database because it's too expensive, then they should have no problem in deciding for the cheaper option of getting the machine 2GB of RAM so that it can do the job required. 2GB of RAM can't cost much, can it? Not when you compare it to the time taken and expense of administering a database.
        How much time goes to administrate a 125 Mb textfile?
        Sorry if I'm missing something, but it's EASY AND FREE to setup a mySQL database. I did it on my laptop in under half-an-hour. Then you can create indexes and sort efficiently and yada..yada..yada. Besides, SQL is a helluvalot easier to learn than Perl.
Re: Reducing Memory Usage
by BrowserUk (Pope) on Jul 16, 2004 at 07:42 UTC

    Whats the average length of string, and how many are there in your 125 MB file?


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
      All lines are 80 chars, adding up to about 1,6 to 1.8 million lines.

        And how many (and what type) of objects does that translate to?


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: Reducing Memory Usage
by Jonathan (Curate) on Jul 16, 2004 at 08:43 UTC
    Have you thought of using DBD:SQLite? it comes with a self contained database and is said to be rather fast. Might be what you need
Re: Reducing Memory Usage
by mhi (Friar) on Jul 16, 2004 at 09:21 UTC
    Since you say that the file size has increased from 5 to 125MB, I'll just guess it won't stop there... So, yes, a Database would be the way to go.

    If that is not feasible, you might want to create a sort-file from your original data that consists of the sorting criteria in a directly (ascii-)sortable fixed-length format starting at the beginning of the line and the original data afterwards, separated by a delimiter.
    This file can then be sorted by any simple sort program. (if you're on a unix box or have cygwin available, 'sort' should do the job easily and you can tweak the buffer size it uses for optimum performance on your box. After all, sorting files is exactly what it was written for!)
    After sorting, just filter out the sorting info and the delimiter again and you have your sorted data.

      That sounds interesting, but I believe before starting that I will definitely go the database way.
      The size is likely to come to an end at 150 to 175 MByte.
      Thank You all for you answers. Anyway I am aware that Perl likes to be slightly !thriftless! when it comes to memory usage. Nevertheless would I like to know if there are any techniques known in Perl to reduce memory usage, (apart from those helping to avoid memory leaks). Does anyone around know any links, documentation, books about this and closely related problems?
        Your selected algorithm is the best way to control Perl's memory usage.

        First, I might suggest that you decode the "wierd" date in your file ONE time, by going through the large file once, and rewriting it to a new file with the "proper" date.

        Second, if your Perl program is just a sorting thing, (or that is at least a major function of it), then if it's a big enough problem, purchasing a dedicated specialized sort program for your OS might be a better investment. Syncsort is such a product that may fit your needs. There are versions for Windows and for most important flavors of UNIX.


      I completely agree with you (mhi). Imagine a few months
      later, you loading a 200,300 or 400 MB file in the
      memory... It's crazy!
      There is so many free databases, like mysql. You should
      think carefully about it.

      -DBC
Re: Reducing Memory Usage ( under 10%)
by BrowserUk (Pope) on Jul 16, 2004 at 09:52 UTC

    Okay. This is just a skeleton, but this creates 50,000 Bus objects, and gives each of them 33 x 80-byte timetables. All are individually getable and setable. All fully OO (externally).

    Total data: 50,000 * 33 * 80 = 125 MB.

    Total process memory consumed: 140 MB.

    Adding methods to manipulate the data is just a case of each method calling the get() routine and then splitting the data into it's constituent bits to manipulate. Trading a little time for memory.

    Or if you need text-key access to your buses using a hash pushes it to 150 MB.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: Reducing Memory Usage
by Gilimanjaro (Hermit) on Jul 16, 2004 at 14:19 UTC
    Another approach: (untested code follows)
    my @objects; while (<FILE>) { my ($location,$time) = /^..(.{9})(.{4})/; push @objects, bless [ $location, $time, $location.$time, tell(FILE) ], MyObject; } package MyObject; sub overload cmp => sub { $_[0]->[2] cmp $_[1]->[2] }; sub location { return shift->[0] } sub time { return shift->[1] } sub record { seek FILE,shift->[3],0; my $b; read FILE,$b,80; return $b + }

    The overload would allow plain old sort to work on the array, and should be pretty fast as the keys to sort on are stored already.

    The time conversion could possible by done by a function which stores previously converted values in a hash, so you can do a cheap hash lookup instead of an expensive conversion for values you've already seen.

    You'll need to make sure the filehandle stays open, possibly in the MyObject package so the records can be retreived when they're actually needed</P.

Re: Reducing Memory Usage
by bunnyman (Hermit) on Jul 16, 2004 at 15:19 UTC
    Anything that makes me understand how memory is allocated in scalars/arrays/hashes?

    Devel::Size

Re: Reducing Memory Usage
by periapt (Hermit) on Jul 20, 2004 at 12:22 UTC
    I admit that I don't fully know your situation. However, I would think seriously about using a database if you don't need one. There is a lot of unrelated/unexpected overhead associated with a database. The costs in time, effort and learning curve can be high. That being said, if you need a database, generally, and believe this problem is a good one to convince your management to let you install one, then go for it.

    On the other hand, this seems to me to be a simple text manipulation problem. You've had a couple of excellent, low footprint solutions posted already. Take another look at them. I assume that you are reading and processing one file at a time. Basically, you need to
    1. Use unix sort to sort each file (maybe into a temp file) on characters 2..10 (on Windows, use GNU utils sort, they are native windows ports of unix utilities)
    2. using Perl, read in each group of lines and process accordingly. Since the records are already grouped, you would only need to read in the # of lines in a group + 1 ( 80 * (# of lines + 1)). For better performance, you can read in each file in chunks to meet a specified memeory size and process each group in a loop.

    Another alternative is to
    1. read the file using Perl and writing each line to a unique id (pos 2..10) temporary files (maybe decoding pos 11..14 on the way).
    2. sort each file on pos 11..14 and if necessary, cat them together to make a single file again. If you name the temp files properly, you can join the groups in any order you desire or need.

    Of course, none of these options are "sexy" per se but given the file sizes you mentioned, the solutions shouldn't take more than a minute or two to run and they don't take much overhead. Hope this helps

    PJ
    unspoken but ever present -- use strict; use warnings; use diagnostics; (if needed)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://374928]
Approved by broquaint
Front-paged by DrHyde
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (9)
As of 2014-07-31 11:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (248 votes), past polls