Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Searching and Coutning using 2 files with multiple columns

by shart3 (Novice)
on Sep 16, 2009 at 14:17 UTC ( #795622=perlquestion: print w/replies, xml ) Need Help??

shart3 has asked for the wisdom of the Perl Monks concerning the following question:

Good Morning Monks,

I have a problem that I could do rather easily in Excel, but my files are too large, and I can't seem to figure out a way for perl to do it, although I know it can.

I have two files Boundary.out and DB.out. Boundary.out look like this:

chr1 3204562 3661779 chr1 4334223 4350673 chr1 4481008 4486694 chr1 4764014 4775968 chr1 4797773 4836816 chr1 4847574 4887987 chr1 4847574 4887987 chr1 4848208 4887987 chr1 4900049 5009660 chr1 5073053 5152630

and DB.out looks like this:

Xkr4 chr1 3204562 3661779 - 3 457.217 Rp1 chr1 4334223 4350673 - 4 16.45 Sox17 chr1 4481008 4486694 - 5 5.686 Mrpl15 chr1 4764014 4775968 - 5 11.954 Lypla1 chr1 4797773 4836816 + 9 39.043 Tcea1 chr1 4847574 4887987 + 10 40.413 Tcea1 chr1 4848208 4887987 + 10 39.779 Rgs20 chr1 4900049 5009660 - 5 109.611

for each row in DB.out I want to count the number of times that a row in Boundary.out occurs where column 0 in Boundary.out eq col1 in DB.out AND col1(Boundary) gt col2(DB) AND col2(Boundary)lt col3(DB). That value should be added to col7(DB).

Here is what I have so far:

open(FILE, @ARGV[0]) || die ("could not open file @ARGV[0]\n"); while (my $line = <FILE>) { chomp $line; my ($chr, $start, $stop) = split(/\t/, $line); } close FILE; open(FILE, @ARGV[1])||die ("could not open file @ARGV[1]\n"); while(<FILE>){ ($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount,$SizeKB)= s +plit; foreach (line in genes.db){ # I don't know what to put here. if ($chr eq $Chrom && $start gt $ModStart && $end lt $ModE +nd){ $Count++; print ;($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount, +$SizeKB,$Count) } }

Can anyone help????????????

Replies are listed 'Best First'.
Re: Searching and Coutning using 2 files with multiple columns
by moritz (Cardinal) on Sep 16, 2009 at 14:27 UTC

    This looks like a perfect application for an SQL query, using DBI and DBD::CSV to work with SQL on plain text files.

    Perl 6 - links to (nearly) everything that is Perl 6.

      Ok. I changed my code to do a Mysql query. I have 2 files DB and a MySQL Table. The mysql table is constructed with 3 column: chrom, start, end. DB has 7 columns. For each line in DB, I want to count the number of times that col2 matches chrom and start is BETWEEN col3 and col4. Then I want to print that value in DB col8. Here is the code I have:

      #!/usr/bin/perl # PERL MODULE use DBI; my $DB_NAME = "H3K36me3"; my $DB_USER = "root"; my $DB_PASS = "password"; my $dbh = DBI->connect("DBI:mysql:$DB_NAME","$DB_USER","$DB_PASS"); open (FILE,"@ARGV[0]")||die "usage: perl MySQL.query.pl <DB> "; @array=<FILE>; close(FILE); foreach $line (@array){ my ($Gene,$chr_id,$left,$right,$Strand,$ExonCount,$SizeKB)=split(/ +\t/,$line); $sql = "select count(*) from H3K36me3 where chrom =\"$chr_id\" AND + start between $left AND $right"; $sth= $dbh->prepare($sql); @Count=$sth->execute()||die "problem Here!\n"; @Count=$sth->fetchrow(); while (@Count = $sth->fetchrow()) { print "@Count[0]\n"; } }

      The problem is that I can print it to the screen and it will slowly work (the MySQL table is 20 million rows), but when I go to print to file with ' perl MySQL.query.pl DB > Out.txt' it stops around ~6000 rows. It actually stops in a wierd way, like if the actual count is 30, it will only print 3. How do I deal with this?

        There are lots of ways in which you can improve your code.

        The first is to tell mysql to build indexes on the chrom and the start columns. The second is to use prepared statements and execute() as show in the DBI documentation and in this tutorial.

        If you use the RaiseError option in DBI->connect, you can leave out the ||die and get much better error messages.

        It actually stops in a wierd way, like if the actual count is 30, it will only print 3. How do I deal with this?

        How does it "stop"? Does it hang? or does it terminate? What is the exit code? does it run out of memory or disk space?

        Perl 6 - links to (nearly) everything that is Perl 6.
Re: Searching and Coutning using 2 files with multiple columns
by kennethk (Abbot) on Sep 16, 2009 at 14:58 UTC
    Ignoring moritz's very good suggestion, you could perform a naive search by storing the contents of the first file in an array of arrays. This would end up looking something like (untested):

    my @file1 = (); open(FILE, @ARGV[0]) || die ("could not open file @ARGV[0]\n"); while (my $line = <FILE>) { chomp $line; my ($chr, $start, $stop) = split(/\t/, $line); push @file1, [$chr, $start, $stop]; } close FILE; open(FILE, @ARGV[1])||die ("could not open file @ARGV[1]\n"); while(<FILE>){ ($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount,$SizeKB)= s +plit; # foreach (line in genes.db){ # I don't know what to put here +. foreach my $line (@file1){ my ($chr, $start, $stop) = @$line; if ($chr eq $Chrom && $start gt $ModStart && $end lt $ModE +nd){ $Count++; print ;($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount, +$SizeKB,$Count) } }

    Note that your original concept had some scoping issues.

      Changing the Boundary.out file to:

      chr1 3204563 3661775 - chr1 3204563 3600000 - chr1 3204500 3660000 - chr1 3204000 3204001 - chr1 3204563 3760000 -

      better illustrates the point. There are 3 instances where I should get a positive result for the first line in DB.out. However, rather than counting the number of times a hit occurs, the line from DB.out is printed 3 times (each with the count value of "1"):

      Xkr4 chr1 3204562 3661779 - 3 457.217 1 Xkr4 chr1 3204562 3661779 - 3 457.217 1 Xkr4 chr1 3204562 3661779 - 3 457.217 1

      I also added a step to reset the count to zero for each loop leaving:

      my @file1 = (); open(FILE, @ARGV[0]) || die ("could not open file @ARGV[0]\n"); while (my $line = <FILE>) { chomp $line; my ($chr, $start, $stop) = split(/\t/, $line); push @file1, [$chr, $start, $stop]; } close FILE; open(FILE, @ARGV[1])||die ("could not open file @ARGV[1]\n"); while(<FILE>){ ($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount,$SizeKB)= s +plit; foreach my $line (@file1){ $Count = 0; my ($chr, $start, $stop) = @$line; if ($chr eq $Chrom && $start gt $ModStart && $end lt $ModE +nd){ $Count++; print ("$Gene\t$Chrom\t$ModStart\t$ModEnd\t$Strand\t$ExonC +ount\t$SizeKB\t$Count\n") } } }

      How do I get it to change the number to 3, and not print 3 times?

        If you want to aggregate results, you need to specify what you are aggregating by. What are you trying to count? Based on what you've written, I will guess you want to know the number of lines of Boundary.out that match each line of DB.out. You can accomplish this by just moving your counter and print statements outside of the foreach loop:

        my $Count = 0; foreach my $line (@file1){ my ($chr, $start, $stop) = @$line; if ($chr eq $Chrom && $start gt $ModStart && $end lt $ModE +nd){ $Count++; } } print ("$Gene\t$Chrom\t$ModStart\t$ModEnd\t$Strand\t$ExonCount +\t$SizeKB\t$Count\n");
Re: Searching and Coutning using 2 files with multiple columns
by ccn (Vicar) on Sep 16, 2009 at 14:54 UTC
    Something like golf
    #!/usr/bin/perl -lan BEGIN { die "Usage: $0 Boundary.out DB.out" unless @ARGV == 2; } if ( $isDB ) { print $_, "\t" . grep { $_->[0] > $F[2] && $_->[1] < $F[3] } @{$chr{$F[1]}} } else { push @{$chr{$F[0]}}, [ @F[1, 2] ]; $isDB = eof; }
      ccn,

      Thanks! This seems to work for the most part (even though I don't understand any of it).

      However, it is replacing the chr# with the count number. For example if the value is 2, I get

      Xkr4 2hr1 3204562 3661779 - 3 457.217

      or of the number is larger (say 12):

      Xkr4 12r1 3204562 3661779 - 3 457.217

      I would like it to be:

      Xkr4 chr1 3204562 3661779 - 3 457.217 2 or Xkr4 chr1 3204562 3661779 - 3 457.217 12

        Hmm... It works for me as desired on your sample data.

        What delimiters do you use between columns? What is your OS?

        Please try this one:

        #!/usr/bin/perl -lan BEGIN { die "Usage: $0 Boundary.out DB.out" unless @ARGV == 2; } if ( $isDB ) { print join "\t", @F, ~~grep { $_->[0] > $F[2] && $_->[1] < $F[3] } @{$chr{$F[1]}} } else { push @{$chr{$F[0]}}, [ @F[1, 2] ]; $isDB = eof; }
Re: Searching and Coutning using 2 files with multiple columns
by bichonfrise74 (Vicar) on Sep 17, 2009 at 00:02 UTC
    This might help you get started.
    #!/usr/bin/perl use strict; my $file_1 = <<END; chr1 3204562 3661779 chr1 4334223 4350673 chr1 4481008 4486694 chr1 4764014 4775968 chr1 4797773 4836816 chr1 4847574 4887987 chr1 4847574 4887987 chr1 4848208 4887987 chr1 4900049 5009660 chr1 5073053 5152630 END my @boundary_records; open( my $file, '<', \$file_1 ) or die( "Error could not open $file_1" ); while( <$file> ) { push( @boundary_records, [ split( /\s+/ ) ] ); } close( $file ); my $count = 0; while( <DATA> ) { chomp; my @test_records = split( /\s+/ ); for my $i (@boundary_records) { if ( $test_records[1] eq $i->[0] and $test_records[2] > $i->[1] and $test_records[3] < $i->[2] ) { $count++; } } print "$_ = $count\n"; $count = 0; } __DATA__ Xkr4 chr1 3204562 3661779 - 3 457.217 Rp1 chr1 4334223 4350673 - 4 16.45 Sox17 chr1 4481008 4486694 - 5 5.686 Mrpl15 chr1 4764014 4775968 - 5 11.954 Lypla1 chr1 4797773 4836816 + 9 39.043 Tcea1 chr1 4847574 4887987 + 10 40.413 Tcea1 chr1 4848208 4887987 + 10 39.779 Rgs20 chr1 4900049 5009660 - 5 109.611

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://795622]
Approved by moritz
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2021-12-02 19:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (23 votes). Check out past polls.

    Notices?