Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Need Help with Maybe a Regex Issue

by graff (Chancellor)
on Feb 27, 2014 at 04:20 UTC ( [id://1076344]=note: print w/replies, xml ) Need Help??


in reply to Need Help with Maybe a Regex Issue

Based on what you seem to be trying to do in your third while loop (shifting values off of @wkArray), I think you actually want to use hashes instead of arrays.

As it is, your code assumes that the two input files have matching elements in matching order. Do you know that the two files are already sorted in the same way, and actually have matching contents?

It looks like you tried to include sample data in your post. Please edit that to put "code" tags around the data. In fact, you could include an adequate sample of data as part of your posted code, like this:

(start with a <code> tag, of course, or just <c>) # your sample code # goes here… (but please just include what you're actually using) # when it's time to read the sample data in your code, do this: while (<DATA>) { # do stuff with the sample data } # more code… enough to exit cleanly with coherent output… __DATA__ your sample data goes here… (don't forget the closing "code" tag…)
Of course, if there are supposed to be two different input files, I think only one of them can be "faked" with the __DATA__ trick, and you'll need to post a separate "code" block for the other data file.

Replies are listed 'Best First'.
Re^2: Need Help with Maybe a Regex Issue
by sharkbait (Initiate) on Feb 27, 2014 at 15:11 UTC
    Thanks! Sorry the data files weren't formatted better, I'll fix that now. The first file is essentially a list of strings, and the second is a list of line-by-line strings.
Re^2: Need Help with Maybe a Regex Issue
by sharkbait (Initiate) on Feb 27, 2014 at 18:10 UTC

    Ok, I got it to work (many thank again to hdb, Kenosis, and graff). I found a format from the NCBI called a GFF file worked much better for parsing. However, I couldn't get the suggested hash function to work, so my updated question is how would I get this to work using a hash instead and more resemble a Perl script than a cobbled Perl/C/something construct? As it stands, it functions, but it's inefficient and not elegant. (Truncated) Sample files also attached. EDIT: I made one more modification, I finally got the hang of using a hash.

    ##gff-version 3 #!gff-spec-version 1.20 #!processor NCBI annotwriter ##sequence-region NC_001903.1 1 26498 ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=2 +24326 NC_001903.1 RefSeq region 1 26498 . + . ID=id0 +;Name=cp26;Dbxref=taxon:224326;Is_circular=true;gbkey=Src;genome=plas +mid;mol_type=genomic DNA;plasmid-name=cp26;strain=B31 NC_001903.1 RefSeq gene 46 324 . + . ID=gene0; +Name=BB_B01;Dbxref=GeneID:1194411;gbkey=Gene;locus_tag=BB_B01 NC_001903.1 RefSeq CDS 46 324 . + 0 ID=cds0;Na +me=NP_046987.2;Parent=gene0;Note=catalyzes the hydrolysis of acylphos +phate;Dbxref=Genbank:NP_046987.2,GeneID:1194411;gbkey=CDS;product=acy +lphosphatase;protein_id=NP_046987.2;transl_table=11 NC_001903.1 RefSeq gene 308 751 . - . ID=gene1 +;Name=BB_B02;Dbxref=GeneID:1194420;gbkey=Gene;locus_tag=BB_B02 NC_001903.1 RefSeq CDS 308 751 . - 0 ID=cds1;N +ame=NP_046988.1;Parent=gene1;Note=hypothetical protein%3B identified +by Glimmer%3B putative;Dbxref=Genbank:NP_046988.1,GeneID:1194420;gbke +y=CDS;product=hypothetical protein;protein_id=NP_046988.1;transl_tabl +e=11 NC_001903.1 RefSeq gene 837 2186 . - . ID=gene +2;Name=BB_B03;Dbxref=GeneID:1194419;gbkey=Gene;locus_tag=BB_B03 NC_001903.1 RefSeq CDS 837 2186 . - 0 ID=cds2; +Name=NP_046989.1;Parent=gene2;Note=hypothetical protein%3B identified + by Glimmer%3B putative;Dbxref=Genbank:NP_046989.1,GeneID:1194419;gbk +ey=CDS;product=hypothetical protein;protein_id=NP_046989.1;transl_tab +le=11 NC_001903.1 RefSeq gene 2476 3798 . - . ID=gen +e3;Name=BB_B04;Dbxref=GeneID:1194410;gbkey=Gene;locus_tag=BB_B04 NC_001903.1 RefSeq CDS 2476 3798 . - 0 ID=cds3 +;Name=NP_046990.2;Parent=gene3;Note=similar to GB:U07818 PID:466474 S +P:Q45400 percent identity: 36.93%3B identified by sequence similarity +%3B putative;Dbxref=Genbank:NP_046990.2,GeneID:1194410;gbkey=CDS;prod +uct=chitibiose transporter protein ChbC;protein_id=NP_046990.2;transl +_table=11 NC_001903.1 RefSeq gene 4084 4431 . + . ID=gen +e4;Name=BB_B05;Dbxref=GeneID:1194409;gbkey=Gene;locus_tag=BB_B05 NC_001903.1 RefSeq CDS 4084 4431 . + 0 ID=cds4 +;Name=NP_046991.1;Parent=gene4;Note=similar to SP:P46319 PID:895750 P +ID:1783268 GB:AL009126 percent identity: 27.62%3B identified by seque +nce similarity%3B putative;Dbxref=Genbank:NP_046991.1,GeneID:1194409; +gbkey=CDS;product=chitibiose transporter protein ChbA;protein_id=NP_0 +46991.1;transl_table=11 NC_001903.1 RefSeq gene 4482 4757 . + . ID=gen +e5;Name=BB_B06;Dbxref=GeneID:1194408;gbkey=Gene;locus_tag=BB_B06 NC_001903.1 RefSeq CDS 4482 4757 . + 0 ID=cds5 +;Name=NP_046992.2;Parent=gene5;Note=similar to SP:P46318 PID:895748 P +ID:1783266 GB:AL009126 percent identity: 46.32%3B identified by seque +nce similarity%3B putative;Dbxref=Genbank:NP_046992.2,GeneID:1194408; +gbkey=CDS;product=chitibiose transporter protein ChbB;protein_id=NP_0 +46992.2;transl_table=11 NC_001903.1 RefSeq gene 4769 5866 . + . ID=gen +e6;Name=BB_B07;Dbxref=GeneID:1194407;gbkey=Gene;locus_tag=BB_B07 NC_001903.1 RefSeq CDS 4769 5866 . + 0 ID=cds6 +;Name=NP_046993.1;Parent=gene6;Note=similar to GB:M88764 SP:Q09090 PI +D:469166 PID:1199570 percent identity: 21.29%3B identified by sequenc +e similarity%3B putative;Dbxref=Genbank:NP_046993.1,GeneID:1194407;gb +key=CDS;product=alpha3-beta1 integrin-binding protein;protein_id=NP_0 +46993.1;transl_table=11 NC_001903.1 RefSeq gene 5888 6517 . - . ID=gen +e7;Name=BB_B08;Dbxref=GeneID:1194418;gbkey=Gene;locus_tag=BB_B08;pseu +do=true NC_001903.1 RefSeq gene 6677 7714 . + . ID=gen +e8;Name=BB_B09;Dbxref=GeneID:1194417;gbkey=Gene;locus_tag=BB_B09 NC_001903.1 RefSeq CDS 6677 7714 . + 0 ID=cds7 +;Name=NP_046995.1;Parent=gene8;Note=hypothetical protein%3B identifie +d by Glimmer%3B putative;Dbxref=Genbank:NP_046995.1,GeneID:1194417;gb +key=CDS;product=hypothetical protein;protein_id=NP_046995.1;transl_ta +ble=11 NC_001903.1 RefSeq gene 7836 8765 . + . ID=gen +e9;Name=BB_B10;Dbxref=GeneID:1194406;gbkey=Gene;locus_tag=BB_B10 NC_001903.1 RefSeq CDS 7836 8765 . + 0 ID=cds8 +;Name=NP_046996.1;Parent=gene9;Note=similar to GP:1655797 percent ide +ntity: 33.68%3B identified by sequence similarity%3B putative;Dbxref= +Genbank:NP_046996.1,GeneID:1194406;gbkey=CDS;product=hypothetical pro +tein;protein_id=NP_046996.1;transl_table=11 NC_001903.1 RefSeq gene 8781 9299 . + . ID=gen +e10;Name=BB_B11;Dbxref=GeneID:1194405;gbkey=Gene;locus_tag=BB_B11;pse +udo=true NC_001903.1 RefSeq gene 9275 10036 . + . ID=ge +ne11;Name=BB_B12;Dbxref=GeneID:1194404;gbkey=Gene;locus_tag=BB_B12 NC_001903.1 RefSeq CDS 9275 10036 . + 0 ID=cds +9;Name=NP_046998.1;Parent=gene11;Note=similar to GP:2182756 percent i +dentity: 46.00%3B identified by sequence similarity%3B putative;Dbxre +f=Genbank:NP_046998.1,GeneID:1194404;gbkey=CDS;product=hypothetical p +rotein;protein_id=NP_046998.1;transl_table=11 NC_001903.1 RefSeq gene 10104 10652 . + . ID=g +ene12;Name=BB_B13;Dbxref=GeneID:1194403;gbkey=Gene;locus_tag=BB_B13 NC_001903.1 RefSeq CDS 10104 10652 . + 0 ID=cd +s10;Name=NP_046999.1;Parent=gene12;Note=similar to GB:U03641 PID:4582 +18 percent identity: 42.59%3B identified by sequence similarity%3B pu +tative;Dbxref=Genbank:NP_046999.1,GeneID:1194403;gbkey=CDS;product=hy +pothetical protein;protein_id=NP_046999.1;transl_table=11 NC_001903.1 RefSeq gene 10920 11417 . - . ID=g +ene13;Name=BB_B14;Dbxref=GeneID:1194402;gbkey=Gene;locus_tag=BB_B14
    BB_B10 BB_B29 BB_B18 BB_B13 BB_B14 BB_B12 BB_B04 BB_B16 BB_B22 BB_B17 BB_B27 BB_B19 BB_B07 BB_B23 BB_B09 BB_B02 BB_B28 BB_B24 BB_B03 BB_B05 BB_B06
    #!/usr/bin/perl -w use Data::Dumper; my @arrayOfVals; open(my $tmp, "<", "/Users/bioinformatics/Desktop/NC_001903.gff.txt") +|| die "Could not open $!"; LINE: while (<$tmp>) { chomp; next LINE if /^#/; # discard header, unneeded and will interfere push(@arrayOfVals, $_); } close($tmp); # input each line as an array my @tmpArray; open (my $arrVal, "<", "/Users/bioinformatics/Desktop/cp26_dff.txt") | +| die "Could not open $!"; while (<$arrVal>) { chomp; push(@tmpArray, $_); } close($arrVal); # same as above, each line of file is match string in array my @tmpDictKeyArray; my @tmpDictValueArray; for (my $k=0; $k<$#arrayOfVals; $k++) { $gffCompare = $arrayOfVals[$k]; $gffDescription = $arrayOfVals[$k+1]; # rationale: $gffCompare is the compare String, # and the information I need will always follow one entry after in + the array # if match is TRUE, then entry+1 will display the information for (my $j=0; $j<$#tmpArray; $j++) { $tmpArrayVal = $tmpArray[$j]; if ($gffCompare =~ /$tmpArrayVal/) { if ($gffCompare =~ /.*;locus_tag=(.*)/) { push(@tmpDictKeyArray, $1); # Assign to array for hash } if ($gffDescription =~ /.*;Name=(.*);protein_/) { push(@tmpDictValueArray, $1); # see above } } } } my %hashArray; @hashArray{@tmpDictKeyArray} = @tmpDictValueArray; print Dumper(\%hashArray);
      Here is a version which may help. You do not need the next LINE bit, and the C-style loop hides a bug - you never test for the last item in @tmpArray (BB_B06).
      my @arrayOfVals; open(my $tmp, "<", "/Users/bioinformatics/Desktop/NC_001903.gff.txt") +or die "Could not open $!"; while (<$tmp>) { chomp; next if /^#/; # discard header, unneeded and will interfere push(@arrayOfVals, $_); } close($tmp); my %hashOfKeys; open (my $arrVal, "<", "/Users/bioinformatics/Desktop/cp26_dff.txt") o +r die "Could not open $!"; while (<$arrVal>) { chomp; $hashOfKeys{$_} = 1; } close($arrVal); my %hash; for my $k (0 .. $#arrayOfVals -1) { if ($arrayOfVals[$k] =~ /.*;locus_tag=(.*)/) { next unless $hashOfKeys{$1}; my $key = $1; if ($arrayOfVals[$k+1] =~ /.*;Name=(.*);protein_/) { $hash{$key} = $1; } } } print Dumper(\%hash);

        I didn't see that, many thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1076344]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2024-04-24 04:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found