Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

create a binary matrix

by Anonymous Monk
on Oct 01, 2016 at 06:00 UTC ( [id://1173049]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have some files in a directory with .fq extension such as NC_002755.fq NC_009525.fq NC_009565.fq NC_012943.fq NC_016768.fq NC_016934.fq NC_017026.fq NC_017522.fq NC_017523.fq NC_017524.fq NC_017528.fq NC_018078.fq NC_018143.fq NC_020089.fq NC_020559.fq NC_021054.fq NC_021192.fq NC_021193.fq NC_021194.fq NC_021251.fq NC_021740.fq NC_022350.fq. I have created a file p.txt in the following format:
(3R)-hydroxyacyl-ACP dehydratase subunit HadA NC_021192.1 (3R)-hydroxyacyl-ACP dehydratase subunit HadB NC_017026.1 (dimethylallyl)adenosine tRNA methylthiotransferase NC_009565.1 +NC_002755.2 NC_009525.1 NC_016934.1 1,4-alpha-glucan branching protein NC_021192.1 NC_017522.1 NC +_016934.1 NC_018078.1 NC_020089.1 NC_002755.2 NC_017524.1 + NC_009565.1 NC_012943.1 NC_017523.1 NC_021740.1 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_016934.1 NC_ +017026.1 NC_009525.1 1,4-dihydroxy-2-naphthoyl-CoA synthase NC_018143.2 NC_009565.1 + NC_002755.2 NC_012943.1 NC_017523.1
where NC_.... is the name of files in the directory with .fq extension. I want to create a binary matrix which has all the filenames in horizontal and genes in vertical and print 1 if filename matches else print 0. My script is :
open (FH, "p.txt"); while ($seq=<FH>) { @seq = split /\t/, $seq; print @seq[0]."\t"; opendir(DIR, "."); @files = grep(/\.fq$/,readdir(DIR)); closedir(DIR); foreach $file (@files) { @file1 = split /\./, $file; $file2 = $file1[0]; $size = @files; $file3 = $file2."\t".$file3;} for ($i = 1; $i <= $size; $i++) { @seq1 = split /\./, @seq[$i]; chomp @seq1[0]; if (@seq1[0] eq $file2) {print "1"."\t";} else {print "0"."\t";} } print "\n"; }
but this gives me incorrect output of 0's and 1's.

Replies are listed 'Best First'.
Re: create a binary matrix
by Laurent_R (Canon) on Oct 01, 2016 at 10:14 UTC
    Please indent your code correctly, that would help you see some of your errors or suboptimal loops. Also use the strict and warnings pragmas, this would also help you.

    How about this:

    use strict; use warnings; my @files = qw / NC_002755.fq NC_009525.fq NC_009565.fq NC_012943.fq N +C_016768.fq NC_016934.fq NC_017026.fq NC_017522.fq NC_017523.fq NC_01 +7524.fq NC_017528.fq NC_018078.fq NC_018143.fq NC_020089.fq /; @files = map {s/\.fq$//; $_} @files; while (my $seq = <DATA>) { my @seq = split /\s+/, $seq; printf "%-50s", "@seq[0..1] "; my %found_files = map { s/\.\w+$// ; $_ => 1} grep {/NC_/} @seq[2. +.$#seq]; for my $file (@files) { print exists $found_files{$file} ? "1\t" : "0\t" } print "\n"; } __DATA__ (3R)-hydroxyacyl-ACP dehydratase subunit HadA NC_021192.1 (3R)-hydroxyacyl-ACP dehydratase subunit HadB NC_017026.1 (dimethylallyl)adenosine tRNA methylthiotransferase NC_009565.1 +NC_002755.2 NC_009525.1 NC_016934.1 1,4-alpha-glucan branching protein NC_021192.1 NC_017522.1 NC +_016934.1 NC_018078.1 NC_020089.1 NC_002755.2 NC_017524.1 + NC_009565.1 NC_012943.1 NC_017523.1 NC_021740.1 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_016934.1 NC_ +017026.1 NC_009525.1 1,4-dihydroxy-2-naphthoyl-CoA synthase NC_018143.2 NC_009565.1 + NC_002755.2 NC_012943.1 NC_017523.1
    Which prints this:
    $ perl fq_files.pl (3R)-hydroxyacyl-ACP dehydratase 0 0 0 + 0 0 0 0 0 0 0 0 0 + 0 0 (3R)-hydroxyacyl-ACP dehydratase 0 0 0 + 0 0 0 1 0 0 0 0 0 + 0 0 (dimethylallyl)adenosine tRNA 1 1 1 + 0 0 1 0 0 0 0 0 0 + 0 0 1,4-alpha-glucan branching 1 0 1 + 1 0 1 0 1 1 1 0 1 + 0 1 1,4-dihydroxy-2-naphthoate octaprenyltransferase 0 1 0 + 0 0 1 1 0 0 0 0 0 + 0 0 1,4-dihydroxy-2-naphthoyl-CoA synthase 1 0 1 + 1 0 0 0 0 1 0 0 0 + 1 0
Re: create a binary matrix
by Random_Walk (Prior) on Oct 01, 2016 at 06:56 UTC

    Can you please give us an idea of what data is in the files, and an example of what you would like your output to look like?

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!
      there are set of reads in the files but that has nothing to do with the output because with those I have created p.txt. The output I want is like:
      gene NC_002755.fq NC_009525.fq NC_009565.fq NC_012943.fq N +C_016768.fq NC_016934.fq NC_017026.fq NC_017522.fq NC_017 +523.fq NC_017524.fq NC_017528.fq NC_018078.fq NC_018143.f +q NC_020089.fq NC_020559.fq NC_021054.fq NC_021192.fq +NC_021193.fq NC_021194.fq NC_021251.fq NC_021740.fq NC_02 +2350.fq (3R)-hydroxyacyl-ACP dehydratase subunit HadA 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 1 0 +0 0 0 0 (3R)-hydroxyacyl-ACP dehydratase subunit HadB 0 0 0 0 0 + 0 1 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 (dimethylallyl)adenosine tRNA methylthiotransferase 1 1 1 +0 0 1 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0
      the result should print 1 if the filename matches else print 0

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1173049]
Approved by beech
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2024-04-24 04:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found