Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

gene location

by nica (Novice)
on Jun 12, 2019 at 13:38 UTC ( #11101284=perlquestion: print w/replies, xml ) Need Help??

nica has asked for the wisdom of the Perl Monks concerning the following question:

I need to extract from a gbff:
gene complement(<5504..>6553) /locus_tag="OT_ostta02g00030" /old_locus_tag="Ot02g00030" /db_xref="GeneID:9832574"
the GeneID related to its coordinates $start = 5504 $stop = 6553 There are a lot of genes so I want to make a loop I use Perl for my code, you can see below a part of it (extracting the numbers) but it doesn't work.
$infile = "GCF.gbff"; $outfile = "a1.txt"; open(IN, $infile) or die "Failed to open $infile\n"; open (OUT, ">$outfile") or die "Failed to open $outfile\n"; $n=0; while ($line = <IN>) { chomp ($line); $line =~ s/\r$//; if ( $line =~ /<\d+..>\d+/ ) { ++$n; my ($start, $stop) = ($1, $2); print OUT "$start, $stop, $header"; } } print STDERR"$n"; sub header { @geneID = split /\s+/, $start, $stop; $header = "$fastaprompt$geneID[1]"; } close (IN); close (OUT);

Replies are listed 'Best First'.
Re: gene location
by Fletch (Chancellor) on Jun 12, 2019 at 13:44 UTC

    Given the apparent problem domain you might look at BioPerl rather than reinventing wheels.

    That being said it jumps out that your regex /\d+..\d+/ won't match the text "<5504..>6553" because you're not allowing for the less/greater before the numbers.

    Also: You might want to clean up your formatting because it's entirely unclear from the indentation and nesting where your loops and if statements are beginning and ending (c.f. Perl::Tidy)

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Actually since the dots are not escaped in the regex, they can match any character, including numbers. So /\d+..\d+/ will match 5504, 02g0, and 9832. Each line will match.

      yes, you are right, I forgot to write them here, in my program they are, but anyway all that I can find in my output file is a lot of commas

        Ah, derp. I blame the bad formatting for missing that. Yeah, that's it.

        At any rate that's because your regex isn't capturing anything so $1 and $2 don't have anything in them. Again, known format don't reinvent wheels you don't have to; but for something quick and dirty . . .

        if( my( $start, $stop ) = $line =~ m{ < (\d+) \.\. > (\d+) }x ) { $n++; print OUT qq{$start, $stop, $header\n}; }

        The cake is a lie.
        The cake is a lie.
        The cake is a lie.

Re: gene location
by betmatt (Acolyte) on Jun 12, 2019 at 20:33 UTC
    You might find that EMBOSS is a better option than BioPerl for this sort of thing. It is a sequence analysis toolkit. ( ) You can do system calls on the command line to activate EMBOSS programs. All that can be done from within a Perl program, with variables inserted into the system call. You are less likely to come across errors or bugs. Sometimes you will be able to work instead with simple flat files. There are many different programs that can be called.

    Update _____________

    The data that you are using looks like it could be generated by hand which means errors or inconsistencies. This makes working with basic flat files all the more important as you can check the results as you go along.
Re: gene location
by karlgoethebier (Monsignor) on Jun 12, 2019 at 18:15 UTC

    Isn't Bio::DB::GenBank made for stuff like this? Best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11101284]
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2019-06-25 20:32 GMT
Find Nodes?
    Voting Booth?
    Is there a future for codeless software?

    Results (107 votes). Check out past polls.