http://www.perlmonks.org?node_id=1032018

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I am in a need for a faster solution. The problem that I am playing with is related to counting. What I have are two files:
1. Tab separated file containing interval coordinates
2. tab separated file containing intervals and some tags
What I need to do is to count how many tags in file 2 there are per interval specified in the first file.

File 1: ID start stop a 12 15 a 12 17 a 13 14 a 14 19 b 10 15 b 12 15 ...
As you can see all possible overlaps are possible.
file 2: Position Tag #a 1 0 2 1 3 0 4 0 ... 10 0 11 1 12 1 13 0 14 0 ... #b 1 0 2 1 3 0 4 0 ... 10 0 11 0 12 1 13 1 14 1 ...
The procedure that I am using goes like this:
#!/usr/bin/perl use strict; use Getopt::Long; use Data::Dumper; my ($optInput1, $optInput2); GetOptions ('i=s' => \$optInput1, 't=s' => \$optInput2); open (IN, "<", $optInput1)|| die "$!"; open (IN1, "<", $optInput2)|| die "$!"; my %hash_pos; # set my intervals while (<IN1>){ chomp; /^([^\t]*)\t([^\t]*)\t([^\t]*)/; $hash_pos{$1}->{$2}->{$3} =0; } close IN1; my %hash_stop; my $p =0; my $id = ""; #count tags per interval while(my $t = <IN>){ chomp($t); if ($t =~/#(.*)\s/){ $p=0; $id = $1; %hash_stop = (); next; } $p++; # sometimes the first column is set to 0 $t =~/^(\d+)\t(\d+)/; # set start and stop boundaries for each interval # if start position has been reached if(exists $hash_pos{$id}->{$p}){ foreach my $o (keys %{$hash_pos{$id}->{$p}}){ $hash_stop{$p}->{$o}=1; } } # foreach interval count ++ if tag == 1 foreach my $keyu (keys %hash_stop){ foreach my $key (keys %{$hash_stop{$keyu}}){ $hash_pos{$id}->{$keyu}->{$key}++ if $2 eq '1'; # if end reached you are don with this # delete it from the hash_stop delete($hash_stop{$keyu}->{$key}) if $p == $key; } } } close IN; print Dumper(\%hash_pos);
but on large files this takes for ever. What i could do is to split file 2 according to its id (a,b,...) and the parallelize the procedure but that is a lot of work. so my question is does anyone has a better idea on how to improve this?

thank you

baxy

UPDATE:

ok.... hm here is the the complete example: i get the right numbers(more or less) file.1

a 12 15 a 12 17 a 13 14 a 14 19 b 10 15 b 12 15
file.2
#a 1 0 2 1 3 0 4 0 5 0 6 1 7 0 8 1 9 1 10 1 11 0 12 0 13 0 14 1 15 1 16 1 17 1 18 1 19 0 20 0 21 0 #b 1 0 2 1 3 0 4 0 5 0 6 1 7 0 8 1 9 1 10 1 11 0 12 1 13 1 14 1 15 1 16 1 17 1 18 1 19 0 20 0 21 0
./test.pl -i file.2 -t file.1
OUTPUT:
$VAR1 = { 'a ' => { '12' => { '17' => 0 } }, 'a' => { '13' => { '14' => 1 }, '12' => { '15' => 2 }, '14' => { '19' => 5 } }, 'b' => { '10' => { '15' => 5 }, '12' => { '15' => 4 } } };
everything looks ok except the first a (which is a bug in my input)

without the mistake in my input file 1 $VAR1 = { 'a' => { '13' => { '14' => 1 }, '12' => { '17' => 4, '15' => 2 }, '14' => { '19' => 5 } }, 'b' => { '10' => { '15' => 5 }, '12' => { '15' => 4 } } };
which is ok

the same code with more consistent variable names. and no obscure comments.

#!/usr/bin/perl use strict; use Getopt::Long; use Data::Dumper; my ($i1, $i2); GetOptions ('i=s' => \$i1, 't=s' => \$i2); open (IN, "<", $i1)|| die "$!"; open (IN1, "<", $i2)|| die "$!"; my %hash1; # set my intervals while (<IN1>){ chomp; /^([^\t]*)\t([^\t]*)\t([^\t]*)/; $hash1{$1}->{$2}->{$3} =0; } close IN1; my %hash2; my $p =0; my $id = ""; #count tags per interval while(<IN>){ chomp; if (/#(.*)/){ $p=0; $id = $1; %hash2 = (); next; } $p++; # sometimes the first column is set to 0 /^(\d+)\t(\d+)/; if(exists $hash1{$id}->{$p}){ foreach my $o (keys %{$hash1{$id}->{$p}}){ $hash2{$p}->{$o}=1; } } foreach my $key1 (keys %hash2){ foreach my $key2 (keys %{$hash2{$key1}}){ $hash1{$id}->{$key1}->{$key2}++ if $2 eq '1'; delete($hash2{$key1}->{$key2}) if $p == $key2; } } } close IN; print Dumper(\%hash1);
UPDATE2:

sorry for deleting my ps comment , i thought it is completely irrelevant since i found "readmore" tags and since it wasn't related to the initially problem.
ok these are some data from some simulation experiments. there is only one file.1 and only one file.2. sizes: there ate approximately 3 mil intervals specified in file.1 and about 50 distinct id's and each id in file.2 has approximately 5-7 * 10^9 lines.
cheers

ps: also the clustering of intervals cannot be assumed.