http://www.perlmonks.org?node_id=863889

snape has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I am facing an interesting problem with multiple files and extraction from those files multiple times. To start with, I have a tab delimited file with 4 columns. For ex:

Col1 Col2 Col3 Col4 File1 abc 1000 1010 File2 xyz 2022 3000 File1 def 3211 2300 File4 ghi 4000 4100 File3 jkl 5002 5100 File4 mno 2001 2500 File5 pqr 100 150 File3 Ade 203 340 File2 Sea 101 201

The first column can have values as File1, File2 .. File40. The column two has unique names. The third and forth column are having numbers.

I have about a million records and I need to extract the string between the numbers that is mentioned between column 3 and column 4 (inclusive). The problem is I am trying to get a method where I am not opening more than one time i.e. I open the files 40 times (since there are 40 files) and extract the string. I am thinking of using a hash table but I am not able to come up with a good logic.

while(<$INPUT>){ chomp($_); my @arr = split('\t',$_); {"$arr[0]_"."$arr[1]"} = "$arr[0]\t$arr[1]\t$arr[2]\t$arr[3]"; ## ha +sh for keeping the files } close($INPUT); for(my $i = 1; $i <= 40; $i++){ open my $IN, "File".$i or die $!; ## Fasta File while(<$IN>){ ## Reading the files and extracting ## but I am not able to use the hash table ##properly } close($IN); } close($OUTPUT);

Since there are more than one string to retrieve from the file, I am not able to do that. Also, please keep in mind that these files are about 100 MB, so storing the files in the memory is also not a good technique. Any hints and help will be appreciated.

Replies are listed 'Best First'.
Re: Multiple Extraction from Multiple Files
by toolic (Bishop) on Oct 07, 2010 at 01:06 UTC
    You might find it easier to work with a data structure like this one:
    use warnings; use strict; use Data::Dumper; use List::Util qw(min max); my %data; while (<DATA>) { my ($file, $name, @nums) = split; push @{ $data{$file}{$name} }, min(@nums), max(@nums); } print Dumper(\%data); __DATA__ File1 abc 1000 1010 File2 xyz 2022 3000 File1 def 3211 2300 File4 ghi 4000 4100 File3 jkl 5002 5100 File4 mno 2001 2500 File5 pqr 100 150 File3 Ade 203 340 File2 Sea 101 201

    This prints out:

      Thanks a lot that works !!
Re: Multiple Extraction from Multiple Files
by JavaFan (Canon) on Oct 07, 2010 at 02:27 UTC
    You have a million records in the file with the 4 columns, or in the 40 files combined? How are your file1 .. file40 structured? Can the pairs of numbers be anywhere? Are the numbers unique? Appear together on a line?

    You don't have to slurp in the first file all at once. You could do 40 passes, first doing all the entries of file1, then of file2, etc. Instead of 40 passes, you could also first sort the file. Or do a single pass, writing out the records to 40 different files, which you then process in order.

    Whether you need to read in the other files, then depends on how they are structured.

Re: Multiple Extraction from Multiple Files
by aquarium (Curate) on Oct 07, 2010 at 05:46 UTC
    From my understanding of your problem: you have a index file that has millions of entries that dictate which file is to be read from, and the numbers somehow indicate the index into the file (quite possibly character positions or such.)
    if that's the case, then i would first sort the index file itself by file number/name column, and then by the character/line indexing columns as required. then as long as the character/column indexes do not overlap within a single file, you should be able to process your such sorted index file and read the indicated files and ranges of whatever in sequential fashion.
    the hardest line to type correctly is: stty erase ^H
Re: Multiple Extraction from Multiple Files
by sundialsvc4 (Abbot) on Oct 07, 2010 at 14:03 UTC

    I do not fully understand your problem.   However, a general solution for handling very large files, rather than using memory, is to perform an on-disk sort.   Any good sort-utility (or module) can handle a 100MB file quite easily.

    When you sort the file, all of the occurrences of any given key-value will be adjacent, and any gaps between key values are known to be empty.   Furthermore, two identically-sorted files can be matched and merged, without searching.

    You say that you are “searching for more than one string.”   If the number of strings being searched-for is reasonable to put in an in-memory hash, then you can simply read each file sequentially, throw the matching records into another file, then go back and process that output file.

    If the number of strings is “much larger,” then you have a classic MERGE situation.   Place the strings into a file and sort it.   Sort each of the 40 files in turn and merge them against that key-file.