Searching and Coutning using 2 files with multiple columns

shart3 has asked for the wisdom of the Perl Monks concerning the following question:

Good Morning Monks,

I have a problem that I could do rather easily in Excel, but my files are too large, and I can't seem to figure out a way for perl to do it, although I know it can.

I have two files Boundary.out and DB.out. Boundary.out look like this:

chr1    3204562    3661779
chr1    4334223    4350673
chr1    4481008    4486694
chr1    4764014    4775968
chr1    4797773    4836816
chr1    4847574    4887987
chr1    4847574    4887987
chr1    4848208    4887987
chr1    4900049    5009660
chr1    5073053    5152630
[download]

and DB.out looks like this:

Xkr4    chr1    3204562    3661779    -    3    457.217
Rp1    chr1    4334223    4350673    -    4    16.45
Sox17    chr1    4481008    4486694    -    5    5.686
Mrpl15    chr1    4764014    4775968    -    5    11.954
Lypla1    chr1    4797773    4836816    +    9    39.043
Tcea1    chr1    4847574    4887987    +    10    40.413
Tcea1    chr1    4848208    4887987    +    10    39.779
Rgs20    chr1    4900049    5009660    -    5    109.611
[download]

for each row in DB.out I want to count the number of times that a row in Boundary.out occurs where column 0 in Boundary.out eq col1 in DB.out AND col1(Boundary) gt col2(DB) AND col2(Boundary)lt col3(DB). That value should be added to col7(DB).

Here is what I have so far:

open(FILE, @ARGV[0]) || 
    die ("could not open file @ARGV[0]\n");

    while (my $line = <FILE>) {
    chomp $line;
    my ($chr, $start, $stop) = split(/\t/, $line);
    }
close FILE;

open(FILE, @ARGV[1])||die ("could not open file @ARGV[1]\n");
    
    while(<FILE>){
        ($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount,$SizeKB)= s
+plit;        
        foreach (line in genes.db){   # I don't know what to put here.
            if ($chr eq $Chrom && $start gt $ModStart && $end lt $ModE
+nd){
            $Count++;
            print ;($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount,
+$SizeKB,$Count)
            }    
    }
[download]

Can anyone help????????????

Comment on Searching and Coutning using 2 files with multiple columns Select or Download Code

Replies are listed 'Best First'.
Re: Searching and Coutning using 2 files with multiple columns by moritz (Cardinal) on Sep 16, 2009 at 14:27 UTC
This looks like a perfect application for an SQL query, using DBI and DBD::CSV to work with SQL on plain text files. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re^2: Searching and Coutning using 2 files with multiple columns by shart3 (Novice) on Feb 19, 2010 at 15:59 UTC
Ok. I changed my code to do a Mysql query. I have 2 files DB and a MySQL Table. The mysql table is constructed with 3 column: chrom, start, end. DB has 7 columns. For each line in DB, I want to count the number of times that col2 matches chrom and start is BETWEEN col3 and col4. Then I want to print that value in DB col8. Here is the code I have: #!/usr/bin/perl # PERL MODULE use DBI; my $DB_NAME = "H3K36me3"; my $DB_USER = "root"; my $DB_PASS = "password"; my $dbh = DBI->connect("DBI:mysql:$DB_NAME","$DB_USER","$DB_PASS"); open (FILE,"@ARGV[0]")\|\|die "usage: perl MySQL.query.pl <DB> "; @array=<FILE>; close(FILE); foreach $line (@array){ my ($Gene,$chr_id,$left,$right,$Strand,$ExonCount,$SizeKB)=split(/ +\t/,$line); $sql = "select count(*) from H3K36me3 where chrom =\"$chr_id\" AND + start between $left AND $right"; $sth= $dbh->prepare($sql); @Count=$sth->execute()\|\|die "problem Here!\n"; @Count=$sth->fetchrow(); while (@Count = $sth->fetchrow()) { print "@Count[0]\n"; } } [download] The problem is that I can print it to the screen and it will slowly work (the MySQL table is 20 million rows), but when I go to print to file with ' perl MySQL.query.pl DB > Out.txt' it stops around ~6000 rows. It actually stops in a wierd way, like if the actual count is 30, it will only print 3. How do I deal with this?	[reply] [d/l]
Re^3: Searching and Coutning using 2 files with multiple columns by moritz (Cardinal) on Feb 19, 2010 at 16:56 UTC
There are lots of ways in which you can improve your code. The first is to tell mysql to build indexes on the `chrom` and the `start` columns. The second is to use prepared statements and `execute()` as show in the DBI documentation and in this tutorial. If you use the `RaiseError` option in `DBI->connect`, you can leave out the `\|\|die` and get much better error messages. It actually stops in a wierd way, like if the actual count is 30, it will only print 3. How do I deal with this? How does it "stop"? Does it hang? or does it terminate? What is the exit code? does it run out of memory or disk space? Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l] [select]
Re^4: Searching and Coutning using 2 files with multiple columns by shart3 (Novice) on Feb 19, 2010 at 17:52 UTC
Re^4: Searching and Coutning using 2 files with multiple columns by shart3 (Novice) on Feb 19, 2010 at 18:08 UTC
Re: Searching and Coutning using 2 files with multiple columns by kennethk (Abbot) on Sep 16, 2009 at 14:58 UTC
Ignoring moritz's very good suggestion, you could perform a naive search by storing the contents of the first file in an array of arrays. This would end up looking something like (untested): my @file1 = (); open(FILE, @ARGV[0]) \|\| die ("could not open file @ARGV[0]\n"); while (my $line = <FILE>) { chomp $line; my ($chr, $start, $stop) = split(/\t/, $line); push @file1, [$chr, $start, $stop]; } close FILE; open(FILE, @ARGV[1])\|\|die ("could not open file @ARGV[1]\n"); while(<FILE>){ ($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount,$SizeKB)= s +plit; # foreach (line in genes.db){ # I don't know what to put here +. foreach my $line (@file1){ my ($chr, $start, $stop) = @$line; if ($chr eq $Chrom && $start gt $ModStart && $end lt $ModE +nd){ $Count++; print ;($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount, +$SizeKB,$Count) } } [download] Note that your original concept had some scoping issues.	[reply] [d/l]
Re^2: Searching and Coutning using 2 files with multiple columns by shart3 (Novice) on Sep 16, 2009 at 16:38 UTC
Changing the Boundary.out file to: `chr1 3204563 3661775 - chr1 3204563 3600000 - chr1 3204500 3660000 - chr1 3204000 3204001 - chr1 3204563 3760000 -` [download] better illustrates the point. There are 3 instances where I should get a positive result for the first line in DB.out. However, rather than counting the number of times a hit occurs, the line from DB.out is printed 3 times (each with the count value of "1"): `Xkr4 chr1 3204562 3661779 - 3 457.217 1 Xkr4 chr1 3204562 3661779 - 3 457.217 1 Xkr4 chr1 3204562 3661779 - 3 457.217 1` [download] I also added a step to reset the count to zero for each loop leaving: my @file1 = (); open(FILE, @ARGV[0]) \|\| die ("could not open file @ARGV[0]\n"); while (my $line = <FILE>) { chomp $line; my ($chr, $start, $stop) = split(/\t/, $line); push @file1, [$chr, $start, $stop]; } close FILE; open(FILE, @ARGV[1])\|\|die ("could not open file @ARGV[1]\n"); while(<FILE>){ ($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount,$SizeKB)= s +plit; foreach my $line (@file1){ $Count = 0; my ($chr, $start, $stop) = @$line; if ($chr eq $Chrom && $start gt $ModStart && $end lt $ModE +nd){ $Count++; print ("$Gene\t$Chrom\t$ModStart\t$ModEnd\t$Strand\t$ExonC +ount\t$SizeKB\t$Count\n") } } } [download] How do I get it to change the number to 3, and not print 3 times?	[reply] [d/l] [select]
Re^3: Searching and Coutning using 2 files with multiple columns by kennethk (Abbot) on Sep 16, 2009 at 18:10 UTC
If you want to aggregate results, you need to specify what you are aggregating by. What are you trying to count? Based on what you've written, I will guess you want to know the number of lines of Boundary.out that match each line of DB.out. You can accomplish this by just moving your counter and print statements outside of the foreach loop: `my $Count = 0; foreach my $line (@file1){ my ($chr, $start, $stop) = @$line; if ($chr eq $Chrom && $start gt $ModStart && $end lt $ModE +nd){ $Count++; } } print ("$Gene\t$Chrom\t$ModStart\t$ModEnd\t$Strand\t$ExonCount +\t$SizeKB\t$Count\n");` [download]	[reply] [d/l]
Re: Searching and Coutning using 2 files with multiple columns by ccn (Vicar) on Sep 16, 2009 at 14:54 UTC
Something like golf `#!/usr/bin/perl -lan BEGIN { die "Usage: $0 Boundary.out DB.out" unless @ARGV == 2; } if ( $isDB ) { print $_, "\t" . grep { $_->[0] > $F[2] && $_->[1] < $F[3] } @{$chr{$F[1]}} } else { push @{$chr{$F[0]}}, [ @F[1, 2] ]; $isDB = eof; }` [download]	[reply] [d/l]
Re^2: Searching and Coutning using 2 files with multiple columns by shart3 (Novice) on Sep 16, 2009 at 17:54 UTC
ccn, Thanks! This seems to work for the most part (even though I don't understand any of it). However, it is replacing the chr# with the count number. For example if the value is 2, I get `Xkr4 2hr1 3204562 3661779 - 3 457.217` [download] or of the number is larger (say 12): `Xkr4 12r1 3204562 3661779 - 3 457.217` [download] I would like it to be: `Xkr4 chr1 3204562 3661779 - 3 457.217 2 or Xkr4 chr1 3204562 3661779 - 3 457.217 12` [download]	[reply] [d/l] [select]
Re^3: Searching and Coutning using 2 files with multiple columns by ccn (Vicar) on Sep 16, 2009 at 18:51 UTC
Hmm... It works for me as desired on your sample data. What delimiters do you use between columns? What is your OS? Please try this one: `#!/usr/bin/perl -lan BEGIN { die "Usage: $0 Boundary.out DB.out" unless @ARGV == 2; } if ( $isDB ) { print join "\t", @F, ~~grep { $_->[0] > $F[2] && $_->[1] < $F[3] } @{$chr{$F[1]}} } else { push @{$chr{$F[0]}}, [ @F[1, 2] ]; $isDB = eof; }` [download]	[reply] [d/l]
Re^4: Searching and Coutning using 2 files with multiple columns by shart3 (Novice) on Sep 17, 2009 at 15:27 UTC
Re^5: Searching and Coutning using 2 files with multiple columns by ccn (Vicar) on Sep 17, 2009 at 17:48 UTC
Re: Searching and Coutning using 2 files with multiple columns by bichonfrise74 (Vicar) on Sep 17, 2009 at 00:02 UTC
This might help you get started. #!/usr/bin/perl use strict; my $file_1 = <<END; chr1 3204562 3661779 chr1 4334223 4350673 chr1 4481008 4486694 chr1 4764014 4775968 chr1 4797773 4836816 chr1 4847574 4887987 chr1 4847574 4887987 chr1 4848208 4887987 chr1 4900049 5009660 chr1 5073053 5152630 END my @boundary_records; open( my $file, '<', \$file_1 ) or die( "Error could not open $file_1" ); while( <$file> ) { push( @boundary_records, [ split( /\s+/ ) ] ); } close( $file ); my $count = 0; while( <DATA> ) { chomp; my @test_records = split( /\s+/ ); for my $i (@boundary_records) { if ( $test_records[1] eq $i->[0] and $test_records[2] > $i->[1] and $test_records[3] < $i->[2] ) { $count++; } } print "$_ = $count\n"; $count = 0; } __DATA__ Xkr4 chr1 3204562 3661779 - 3 457.217 Rp1 chr1 4334223 4350673 - 4 16.45 Sox17 chr1 4481008 4486694 - 5 5.686 Mrpl15 chr1 4764014 4775968 - 5 11.954 Lypla1 chr1 4797773 4836816 + 9 39.043 Tcea1 chr1 4847574 4887987 + 10 40.413 Tcea1 chr1 4848208 4887987 + 10 39.779 Rgs20 chr1 4900049 5009660 - 5 109.611 [download]	[reply] [d/l]


go ahead... be a heretic
	PerlMonks