http://www.perlmonks.org?node_id=341556

Murcia has asked for the wisdom of the Perl Monks concerning the following question:

Hi Conferes,

I have a large file with identifer in each row of this file. Equal ids are unevenly spreaded within the file. I need the data after the id, so I have to collect all data for the same id, grepping for each id the whole file. is takes a long time!
My idea is first to sort the file by id, so that I can find the first position in the file with a id, reading lines till the id changes.

eg: 
10 data1
12 data2
13 data4
1  data5
10 data6
12 data7
2  data8
10 data11

sorted: 
1 data5
10 data1 # beginn id 10
10 data11
10 data6 #end id 10
12 data2
12 data7
13 data4

should I load the file in an array, sort and print it? Or are there better ways (the file has 183M)?

Thanks and GOD saves PERL!

Guido

Replies are listed 'Best First'.
Re: sort large file
by borisz (Canon) on Apr 01, 2004 at 09:16 UTC
    put the data in a database. DBD::SQLite is easy to use and quick. If you just want to sort the file once use sort -n file >newfile. Or quick and dirty perl -ne 'print if /^10/'.
    Boris
      I will try "sort -n file > out" I need all ids! not only 10! ;-)
        I understand it in the way, that you are interested in only one id at a time. 10 was only a simple example.
        Boris
Re: sort large file
by pelagic (Priest) on Apr 01, 2004 at 09:26 UTC
    The best solution is dependant of a couple of parameters:
    - How often do you run that program (1/day, 1/minute ?)
    - How static is the data (do you get new entries every time you run your program?)
    - Are you looking for 1 key at a time (in a online app) or do you anyway scan through all the keys in 1 run?

    update
    Now as we know you have to read all keys anyway just use a hash with your keys and "feed" an array with your data:
    push (@{$myhash{$id}}, "$data");

    pelagic
    -------------------------------------
    I can resist anything but temptation.

      This was my first inclination as well -- note that you don't have to sort the file first if you do it this way. A potential downside is that you wind up with the entire data structure in memory.

      One option to avoid the memory consumption is to use something like DBD::SQLite as mentioned above. Another option (probably slower and bulkier, but not requiring knowing DBI and SQL) might be to use Tie::MLDBM to store that hash on disk as a DB_File rather than in memory. That might wind up being larger than the original text file, but at least it'll be easy to access it again in perl.

      Note: if you use Tie::MLDBM, you'll need to extract the array from the hash to a temporary array, push your data to the temporary array, and then store the temporary array back in the hash. Read the pod for more details. E.g.:

      # Code not tested - consider it conceptual # use strict; use warnings; use Tie::MLDBM; tie my %hash, 'Tie::MLDBM', { 'Serialise' => 'Storable', 'Store' => 'DB_File' }, 'mybighash.dbm', O_CREAT|O_RDWR, 0640 or die $!; while (<>) { chomp; my ($id, $data) = /(\d+)\s+(.*)/; my $aref = $hash{$id} || []; push @{$aref}, $data; $hash{$id} = $aref; }

      Good luck!

      -xdg

      Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

      Also if the id's are both small and numeric...

      use strict; use warnings; my @myarray; while(<>) { chop; my($id, $data) = $_ =~ /(\d+)\s(.*)/; push (@{$myarray[$id]}, "$data"); } for(my $i=0;$i<@myarray;$i++) { next unless defined($myarray[$i]); foreach my $data ( @{$myarray[$i]} ) { print "$i $data\n"; } }

      This does the above and also the data is sorted. However, if the id numbers are large this will run you out of memory because perl will allocate the array out to the largest id. Also any id which is not present in the data will leave an undef in its position in the array. These need to be checked for and skipped or you will get warnings.

      This method basically reads the entire file into memory in an organized format and then dumps it back out in id order. Note thought that if you have large id numbers and they are sparse, i.e. most are unused, then the memory will be larger than the hash method. If however the id numbers are dense and numeric this will be pretty memory efficient.

      As also mentioned before by Abigail II, if you just want a sorted output file then the compiled utilities (sort) will probably an order of magnitude or so faster.

Re: sort large file
by castaway (Parson) on Apr 01, 2004 at 09:18 UTC
    2 ideas:
    1. If you're on unix, use 'sort' to presort the file.
    2. Don't run through the file once for each id, go through one time only, and read in the data, assigning data to a hash as you go, collecting the correct ids.

    C.

Re: sort large file
by Abigail-II (Bishop) on Apr 01, 2004 at 13:02 UTC
    Use the sort command, it's optimized to sort large files. It can figure out when to sort in memory, and when to use temporary files. It's a standard Unix tool, and has been ported to different OSses, including Windows.

    Abigail

      ...unless the file is truly huge. Sort has its limitations - for instance, on solaris it tends to use /var/tmp to store that temporary file, and if you're dealing with a few gigs worth of data, you can easily run out of even disk space for the tmp files (painful, true story). Not that I have a better solution :)



      "I have never written bad code. There are merely unanticipated features."
        Solaris sort offers -T to specify a directory for its temprary files.
        Sure, there will always be limitations. But it's usually cheaper/easier to add a few gigs of disk space than it is to add a few gigs of RAM. (And if you do it in Perl, multiply that by a factor as well). Typically, 'sort' will need free disk space about the same amount as the file to sort is large.

        Abigail

Re: sort large file
by eric256 (Parson) on Apr 01, 2004 at 16:26 UTC

    You could scan through and make a new file for each id and put the data for that id in the file. Then finding all data for an ID is just opening the right file and reading. I would recommend a database first but sometimes thats just not doable :-)

    The orignal parse for this method might be slower but then seperate runs would be quicker. Unlike a hash solution this doesn't put everything into memory which means its persistent between different runs of the program.


    ___________
    Eric Hodges