shart3 has asked for the wisdom of the Perl Monks concerning the following question:
Good Morning Monks,
I have a problem that I could do rather easily in Excel, but my files are too large, and I can't seem to figure out a way for perl to do it, although I know it can.
I have two files Boundary.out and DB.out. Boundary.out look like this:
chr1 3204562 3661779
chr1 4334223 4350673
chr1 4481008 4486694
chr1 4764014 4775968
chr1 4797773 4836816
chr1 4847574 4887987
chr1 4847574 4887987
chr1 4848208 4887987
chr1 4900049 5009660
chr1 5073053 5152630
and DB.out looks like this:
Xkr4 chr1 3204562 3661779 - 3 457.217
Rp1 chr1 4334223 4350673 - 4 16.45
Sox17 chr1 4481008 4486694 - 5 5.686
Mrpl15 chr1 4764014 4775968 - 5 11.954
Lypla1 chr1 4797773 4836816 + 9 39.043
Tcea1 chr1 4847574 4887987 + 10 40.413
Tcea1 chr1 4848208 4887987 + 10 39.779
Rgs20 chr1 4900049 5009660 - 5 109.611
for each row in DB.out I want to count the number of times that a row in Boundary.out occurs where column 0 in Boundary.out eq col1 in DB.out AND col1(Boundary) gt col2(DB) AND col2(Boundary)lt col3(DB). That value should be added to col7(DB).
Here is what I have so far:
open(FILE, @ARGV[0]) ||
die ("could not open file @ARGV[0]\n");
while (my $line = <FILE>) {
chomp $line;
my ($chr, $start, $stop) = split(/\t/, $line);
}
close FILE;
open(FILE, @ARGV[1])||die ("could not open file @ARGV[1]\n");
while(<FILE>){
($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount,$SizeKB)= s
+plit;
foreach (line in genes.db){ # I don't know what to put here.
if ($chr eq $Chrom && $start gt $ModStart && $end lt $ModE
+nd){
$Count++;
print ;($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount,
+$SizeKB,$Count)
}
}
Can anyone help????????????
Re: Searching and Coutning using 2 files with multiple columns
by moritz (Cardinal) on Sep 16, 2009 at 14:27 UTC
|
This looks like a perfect application for an SQL query, using DBI and DBD::CSV to work with SQL on plain text files.
Perl 6 - links to (nearly) everything that is Perl 6.
| [reply] |
|
Ok. I changed my code to do a Mysql query.
I have 2 files DB and a MySQL Table. The mysql table is constructed with 3 column: chrom, start, end.
DB has 7 columns. For each line in DB, I want to count the number of times that col2 matches chrom and start is BETWEEN col3 and col4. Then I want to print that value in DB col8. Here is the code I have:
#!/usr/bin/perl
# PERL MODULE
use DBI;
my $DB_NAME = "H3K36me3";
my $DB_USER = "root";
my $DB_PASS = "password";
my $dbh = DBI->connect("DBI:mysql:$DB_NAME","$DB_USER","$DB_PASS");
open (FILE,"@ARGV[0]")||die "usage: perl MySQL.query.pl <DB> ";
@array=<FILE>;
close(FILE);
foreach $line (@array){
my ($Gene,$chr_id,$left,$right,$Strand,$ExonCount,$SizeKB)=split(/
+\t/,$line);
$sql = "select count(*) from H3K36me3 where chrom =\"$chr_id\" AND
+ start between $left AND $right";
$sth= $dbh->prepare($sql);
@Count=$sth->execute()||die "problem Here!\n";
@Count=$sth->fetchrow();
while (@Count = $sth->fetchrow()) {
print "@Count[0]\n";
}
}
The problem is that I can print it to the screen and it will slowly work (the MySQL table is 20 million rows), but when I go to print to file with ' perl MySQL.query.pl DB > Out.txt' it stops around ~6000 rows. It actually stops in a wierd way, like if the actual count is 30, it will only print 3.
How do I deal with this? | [reply] [d/l] |
|
There are lots of ways in which you can improve your code.
The first is to tell mysql to build indexes on the chrom and the start columns. The second is to use prepared statements and execute() as show in the DBI documentation and in this tutorial.
If you use the RaiseError option in DBI->connect, you can leave out the ||die and get much better error messages.
It actually stops in a wierd way, like if the actual count is 30, it will only print 3. How do I deal with this?
How does it "stop"? Does it hang? or does it terminate? What is the exit code? does it run out of memory or disk space?
Perl 6 - links to (nearly) everything that is Perl 6.
| [reply] [d/l] [select] |
|
|
Re: Searching and Coutning using 2 files with multiple columns
by kennethk (Abbot) on Sep 16, 2009 at 14:58 UTC
|
Ignoring moritz's very good suggestion, you could perform a naive search by storing the contents of the first file in an array of arrays. This would end up looking something like (untested):
my @file1 = ();
open(FILE, @ARGV[0]) ||
die ("could not open file @ARGV[0]\n");
while (my $line = <FILE>) {
chomp $line;
my ($chr, $start, $stop) = split(/\t/, $line);
push @file1, [$chr, $start, $stop];
}
close FILE;
open(FILE, @ARGV[1])||die ("could not open file @ARGV[1]\n");
while(<FILE>){
($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount,$SizeKB)= s
+plit;
# foreach (line in genes.db){ # I don't know what to put here
+.
foreach my $line (@file1){
my ($chr, $start, $stop) = @$line;
if ($chr eq $Chrom && $start gt $ModStart && $end lt $ModE
+nd){
$Count++;
print ;($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount,
+$SizeKB,$Count)
}
}
Note that your original concept had some scoping issues. | [reply] [d/l] |
|
chr1 3204563 3661775 -
chr1 3204563 3600000 -
chr1 3204500 3660000 -
chr1 3204000 3204001 -
chr1 3204563 3760000 -
better illustrates the point. There are 3 instances where I should get a positive result for the first line in DB.out. However, rather than counting the number of times a hit occurs, the line from DB.out is printed 3 times (each with the count value of "1"):
Xkr4 chr1 3204562 3661779 - 3 457.217 1
Xkr4 chr1 3204562 3661779 - 3 457.217 1
Xkr4 chr1 3204562 3661779 - 3 457.217 1
I also added a step to reset the count to zero for each loop leaving:
my @file1 = ();
open(FILE, @ARGV[0]) ||
die ("could not open file @ARGV[0]\n");
while (my $line = <FILE>) {
chomp $line;
my ($chr, $start, $stop) = split(/\t/, $line);
push @file1, [$chr, $start, $stop];
}
close FILE;
open(FILE, @ARGV[1])||die ("could not open file @ARGV[1]\n");
while(<FILE>){
($Gene,$Chrom,$ModStart,$ModEnd,$Strand,$ExonCount,$SizeKB)= s
+plit;
foreach my $line (@file1){
$Count = 0;
my ($chr, $start, $stop) = @$line;
if ($chr eq $Chrom && $start gt $ModStart && $end lt $ModE
+nd){
$Count++;
print ("$Gene\t$Chrom\t$ModStart\t$ModEnd\t$Strand\t$ExonC
+ount\t$SizeKB\t$Count\n")
}
}
}
How do I get it to change the number to 3, and not print 3 times? | [reply] [d/l] [select] |
|
If you want to aggregate results, you need to specify what you are aggregating by. What are you trying to count? Based on what you've written, I will guess you want to know the number of lines of Boundary.out that match each line of DB.out. You can accomplish this by just moving your counter and print statements outside of the foreach loop:
my $Count = 0;
foreach my $line (@file1){
my ($chr, $start, $stop) = @$line;
if ($chr eq $Chrom && $start gt $ModStart && $end lt $ModE
+nd){
$Count++;
}
}
print ("$Gene\t$Chrom\t$ModStart\t$ModEnd\t$Strand\t$ExonCount
+\t$SizeKB\t$Count\n");
| [reply] [d/l] |
Re: Searching and Coutning using 2 files with multiple columns
by ccn (Vicar) on Sep 16, 2009 at 14:54 UTC
|
#!/usr/bin/perl -lan
BEGIN {
die "Usage: $0 Boundary.out DB.out"
unless @ARGV == 2;
}
if ( $isDB ) {
print $_, "\t" . grep { $_->[0] > $F[2] && $_->[1] < $F[3] }
@{$chr{$F[1]}}
}
else {
push @{$chr{$F[0]}}, [ @F[1, 2] ];
$isDB = eof;
}
| [reply] [d/l] |
|
ccn,
Thanks! This seems to work for the most part (even though I don't understand any of it).
However, it is replacing the chr# with the count number. For example if the value is 2, I get
Xkr4 2hr1 3204562 3661779 - 3 457.217
or of the number is larger (say 12):
Xkr4 12r1 3204562 3661779 - 3 457.217
I would like it to be: Xkr4 chr1 3204562 3661779 - 3 457.217 2
or
Xkr4 chr1 3204562 3661779 - 3 457.217 12
| [reply] [d/l] [select] |
|
#!/usr/bin/perl -lan
BEGIN {
die "Usage: $0 Boundary.out DB.out"
unless @ARGV == 2;
}
if ( $isDB ) {
print join "\t", @F, ~~grep { $_->[0] > $F[2] && $_->[1] < $F[3] }
@{$chr{$F[1]}}
}
else {
push @{$chr{$F[0]}}, [ @F[1, 2] ];
$isDB = eof;
}
| [reply] [d/l] |
|
|
Re: Searching and Coutning using 2 files with multiple columns
by bichonfrise74 (Vicar) on Sep 17, 2009 at 00:02 UTC
|
This might help you get started.
#!/usr/bin/perl
use strict;
my $file_1 = <<END;
chr1 3204562 3661779
chr1 4334223 4350673
chr1 4481008 4486694
chr1 4764014 4775968
chr1 4797773 4836816
chr1 4847574 4887987
chr1 4847574 4887987
chr1 4848208 4887987
chr1 4900049 5009660
chr1 5073053 5152630
END
my @boundary_records;
open( my $file, '<', \$file_1 )
or die( "Error could not open $file_1" );
while( <$file> ) {
push( @boundary_records, [ split( /\s+/ ) ] );
}
close( $file );
my $count = 0;
while( <DATA> ) {
chomp;
my @test_records = split( /\s+/ );
for my $i (@boundary_records) {
if ( $test_records[1] eq $i->[0]
and $test_records[2] > $i->[1]
and $test_records[3] < $i->[2] ) {
$count++;
}
}
print "$_ = $count\n";
$count = 0;
}
__DATA__
Xkr4 chr1 3204562 3661779 - 3 457.217
Rp1 chr1 4334223 4350673 - 4 16.45
Sox17 chr1 4481008 4486694 - 5 5.686
Mrpl15 chr1 4764014 4775968 - 5 11.954
Lypla1 chr1 4797773 4836816 + 9 39.043
Tcea1 chr1 4847574 4887987 + 10 40.413
Tcea1 chr1 4848208 4887987 + 10 39.779
Rgs20 chr1 4900049 5009660 - 5 109.611
| [reply] [d/l] |
|
|