in reply to sort large file

The best solution is dependant of a couple of parameters:
- How often do you run that program (1/day, 1/minute ?)
- How static is the data (do you get new entries every time you run your program?)
- Are you looking for 1 key at a time (in a online app) or do you anyway scan through all the keys in 1 run?

Now as we know you have to read all keys anyway just use a hash with your keys and "feed" an array with your data:
push (@{$myhash{$id}}, "$data");

I can resist anything but temptation.

Replies are listed 'Best First'.
Re: Re: sort large file
by xdg (Monsignor) on Apr 01, 2004 at 16:58 UTC

    This was my first inclination as well -- note that you don't have to sort the file first if you do it this way. A potential downside is that you wind up with the entire data structure in memory.

    One option to avoid the memory consumption is to use something like DBD::SQLite as mentioned above. Another option (probably slower and bulkier, but not requiring knowing DBI and SQL) might be to use Tie::MLDBM to store that hash on disk as a DB_File rather than in memory. That might wind up being larger than the original text file, but at least it'll be easy to access it again in perl.

    Note: if you use Tie::MLDBM, you'll need to extract the array from the hash to a temporary array, push your data to the temporary array, and then store the temporary array back in the hash. Read the pod for more details. E.g.:

    # Code not tested - consider it conceptual # use strict; use warnings; use Tie::MLDBM; tie my %hash, 'Tie::MLDBM', { 'Serialise' => 'Storable', 'Store' => 'DB_File' }, 'mybighash.dbm', O_CREAT|O_RDWR, 0640 or die $!; while (<>) { chomp; my ($id, $data) = /(\d+)\s+(.*)/; my $aref = $hash{$id} || []; push @{$aref}, $data; $hash{$id} = $aref; }

    Good luck!


    Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

Re: Re: sort large file
by dga (Hermit) on Apr 01, 2004 at 17:24 UTC

    Also if the id's are both small and numeric...

    use strict; use warnings; my @myarray; while(<>) { chop; my($id, $data) = $_ =~ /(\d+)\s(.*)/; push (@{$myarray[$id]}, "$data"); } for(my $i=0;$i<@myarray;$i++) { next unless defined($myarray[$i]); foreach my $data ( @{$myarray[$i]} ) { print "$i $data\n"; } }

    This does the above and also the data is sorted. However, if the id numbers are large this will run you out of memory because perl will allocate the array out to the largest id. Also any id which is not present in the data will leave an undef in its position in the array. These need to be checked for and skipped or you will get warnings.

    This method basically reads the entire file into memory in an organized format and then dumps it back out in id order. Note thought that if you have large id numbers and they are sparse, i.e. most are unused, then the memory will be larger than the hash method. If however the id numbers are dense and numeric this will be pretty memory efficient.

    As also mentioned before by Abigail II, if you just want a sorted output file then the compiled utilities (sort) will probably an order of magnitude or so faster.