perl script for extracting gene information from gff file

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Assalam o alaikum everyone ,, I want to extract genes information from annotation files (gff). I have gff files of different genomes downloaded from NCBI and two txt files 1 of them contains full path of gff files and 2nd contains list of genes which information have to extract from annotation files. $ cat annotation_files_path ~/home/usr/NCBI_Mammals_genomes/Nine_banded_armadillo/ ~/home/usr/NCBI_Mammals_genomes/Noth_American_deer_mouse/

#!usr/bin/perl
#
# read genes into array
#
my @ARGV = [];
my $genesFn = "Genes_list.txt";
open my $genesFh, "<", $genesFn or die "could not open genes file hand
+le!\n";
while (<$genesFh>) {
    chomp;
    push @ARGV, $_;
}


#
# write any matches between annotation line and pattern to results fil
+e
#
my $annotationsFn = "GCF_000298275.1_OryAfe1.0_genomic.gff";
my $resultsFn = "answer.gff";
open my $annotationsFh, "<", $annotationsFn or die "could not open ann
+otations file handle!\n";
open my $resultsFh, ">", $resultsFn or die "could not open handle to r
+esults file!\n";
while (<$annotationsFh>) {
    chomp;
    if ($_ =~ /ACMSD/ || $_ =~ /CRYM/ || $_ =~ /ARID1B/  {
        print $resultsFh " $_\n";
    }

close $resultsFh;
close $annotationsFh;
close $genesFh;
[download]

I have two problems now first one is that all genes information come in a single file but i want different output file for different genes and if gene not present it should be mentioned,, and second and major problem is that i have to work on about 100 annotation files and in annotation_file_path.txt i gave complete path of all annotation files but failed to execute it because script only work when a single annotation file use as shown above otherwise it print whole annotation _file_path.txt ,,, is there any way to open each path of annotation_file.txt one by one ??? kindly guide me ??? i'm a beginner so happy to give me more info..

Comment on perl script for extracting gene information from gff file Download Code

Replies are listed 'Best First'.
Re: perl script for extracting gene information from gff file by kcott (Archbishop) on Jun 07, 2017 at 07:50 UTC
There are a number of issues here: "i'm a beginner so happy to give me more info." — OK, that's fair enough, there are missing pieces of information that would help us to help you. These include things like: example input, actual output, expected output and diagnostic messages. See "How do I post a question effectively?" and "SSCCE" for details. "$ cat annotation_files_path ...": To be honest, I'm questioning whether "`cat`" was intended: two of the three arguments shown are directories. If "`cat`" was intended, where's the output? Please wrap commands and output in `<code>...</code>` tags: it's easier to read; output format is not lost; and, you don't need to do anything about special characters (e.g. there's no need to change '`<`' to '`<`'). The two pathnames starting with "`~/home/usr/`" look wrong: "`/home/usr/`", "`~usr/`", or just "`~/`" would be more usual. If it's not wrong (e.g. perhaps it expands to "`/home/your_uname/home/usr/`") you should mention this because, as it stands, it's confusing. There are a number of issues with your code: "i'm a beginner" — take a look at "perlintro - Perl introduction for beginners". The shebang line (`#!usr/bin/perl`) looks wrong: there should probably be a slash before `usr` (i.e. `#!/usr/bin/perl`). Always use the strict and warnings pragmata. If you followed the perlintro link (above), you'll know why. `@ARGV`: This is a special variable: see "perlvar - Perl predefined variables". You've declared a lexical variable (`my @ARGV`) with the same name: this is rarely, if ever, a good idea. It's an array variable but you've assigned a scalar value to it (`@ARGV = []`). While that works syntactically, it's unlikely to work as functionally intended. You possibly meant '`()`' instead of '`[]`'; however, that's redundant anyway. Having populated that array, you don't subsequently use it! My code below has examples of `@ARGV` usage and array declaration. The following code (`pm_1192233_multi_file_output.pl`) may do roughly what you want. If not, there should be sufficient examples of techniques to get you on the right track. #!/usr/bin/env perl use strict; use warnings; use autodie; use constant { GENE_FILE => 'pm_1192233_genes.txt', MISSING_FILE => 'pm_1192233_missing.txt', OUT_PREFIX => 'pm_1192233_out_', }; my @ano_files = @ARGV; my ($gene_list, $gene_re) = get_gene_data(GENE_FILE); for my $ano_file (@ano_files) { open my $fh, '<', $ano_file; while (<$fh>) { print { get_out_fh($1) } $_ if /$gene_re/; } } gen_missing_report($gene_list); { my %fh_for; sub get_out_fh { unless (exists $fh_for{$_[0]}) { open $fh_for{$_[0]}, '>', OUT_PREFIX . $_[0]; } return $fh_for{$_[0]}; } sub gen_missing_report { my ($genes) = @_; open my $fh, '>', MISSING_FILE; print $fh "--- START MISSING LIST ---\n"; print $fh "$_\n" for grep { ! exists $fh_for{$_} } @$genes; print $fh "--- END MISSING LIST ---\n"; } } sub get_gene_data { my ($file) = @_; my @list; open my $fh, '<', $file; push @list, $_ while <$fh>; chomp @list; @list = sort { length $b <=> length $a } @list; my $alt = join '\|', @list; return (\@list, qr{($alt)}); } [download] I created some dummy input: `$ cat pm_1192233_genes.txt ACMSD CRYM ARIB1B GENE_NOT_ANOTATED ALSO_ABSENT $ cat pm_1192233_anot_1.txt ACMSD 1 XXXX 1 CRYM 1 $ cat pm_1192233_anot_2.txt ARIB1B 2 YYYY 2 ACMSD 2 $ cat pm_1192233_anot_3.txt WWWW 3 CRYM 3 ARIB1B 3 ZZZZ 3` [download] I ran the script like this: $ pm_1192233_multi_file_output.pl pm_1192233_anot_* $ [download] These files were produced: `$ cat pm_1192233_out_ACMSD ACMSD 1 ACMSD 2 $ cat pm_1192233_out_ARIB1B ARIB1B 2 ARIB1B 3 $ cat pm_1192233_out_CRYM CRYM 1 CRYM 3 $ cat pm_1192233_missing.txt --- START MISSING LIST --- GENE_NOT_ANOTATED ALSO_ABSENT --- END MISSING LIST ---` [download] Feel free to post follow-up questions, but please adhere to the guidelines I linked to at the start. You might also consider registering so that we can tell you apart from others posting anonymously. — Ken	[reply] [d/l] [select]
Re^2: perl script for extracting gene information from gff file by Anonymous Monk on Jun 09, 2017 at 00:45 UTC
Thank you sooo much Ken for your reply ,, it's so helpful. :) Adeena	[reply]
Re: perl script for extracting gene information from gff file by huck (Prior) on Jun 07, 2017 at 03:00 UTC
You dont explain very well what you want. What do you mean by gene for instance I think this does most of what you are after. but it is a memory hog. use strict; use warnings; my @genes; my %genes_found; my $genesFn = "Genes_list.txt"; open my $genesFh, "<", $genesFn or die "could not open genes file hand +le $!\n"; while (my $gene=<$genesFh>) { chomp $gene; push @genes, $gene; } my $an_fn='annotation_files_path.txt'; open my $an,'<',$an_fn or die "cant open $an_fn $!"; while (my $annotationsFn =<$an>){ chomp $annotationsFn ; open my $gff,'<',$annotationsFn or die "cant open $annotationsFn $! +"; while (my $gff_line =<$gff>){ chomp $gff_line; for my $gene (@genes) { if (index($gff_line,$gene) != -1) { $genes_found{$gene}{$annotationsFn}=[] unless (defined ($genes +_found{$gene}{$annotationsFn})); push @{$genes_found{$gene}{$annotationsFn}},$gff_line; } # index } # gene } # gff_line close $gff; } # my $annotationsFn close $an; for my $gene (@genes) { unless ($genes_found{$gene}) { print "gene $gene not found \n";} else { my $resultsFn='result-gene-'.$gene.'.txt'; open my $resultsFh, ">", $resultsFn or die "could not open handle +to results file $!\n"; for my $annotationsFn (sort keys (%{$genes_found{$gene}})) { print $resultsFh 'from:'.$annotationsFn."\n"; for my $found (@{$genes_found{$gene}{$annotationsFn}}){ print $resultsFh 'line:'.$found."\n"; } # found } #$annotationsFn close $resultsFh; } # else } # gene [download]	[reply] [d/l]
Re^2: perl script for extracting gene information from gff file by Anonymous Monk on Jun 07, 2017 at 07:39 UTC
Thanks #Huck for your answer. :) Sorry for unclear description. I aim to extract different genes information from annotation file ( in gff format) according to Gene_LIST.txt file. for instance genes in my file are,, $ cat Gene_LIST.txt AMSD ARID1B CRYM	[reply]
Re: perl script for extracting gene information from gff file by hippo (Bishop) on Jun 07, 2017 at 09:14 UTC
In addition to the excellent post (++) by brother kcott I should point out just in case you were unaware that your code as posted doesn't even compile. `$ perl -cw 1192233.pl syntax error at 1192233.pl line 23, near "/ARID1B/ {" Missing right curly or square bracket at 1192233.pl line 29, at end of + line 1192233.pl had compilation errors.` [download] If you are seeking help with a particular piece of code it is always worth making sure that it compiles (unless the compilation itself is the problem) and that it is actually the code which you are running. As a final hint to help you prepare an SSCCE, consider that to make the script self-contained and to ensure that you are not simply reading the files incorrectly, why not replace any reading-from-files with hard-coded arrays containing no more than the first 3 or 4 lines from each file? That makes it much easier for others to run, avoids path, permissions, encoding and EOL issues and aids analysis because the data is right there in the code. Once you have that working you can replace each static set with data read from a file. Do this one at a time in case any such replacement brings in unexpected bugs.	[reply] [d/l]
Re^2: perl script for extracting gene information from gff file by Anonymous Monk on Jun 09, 2017 at 00:51 UTC
Thank you hippo ,, i will consider it and improve my mistakes actually this was my ever first post .. :)	[reply]


Your skill will accomplish what the force of many cannot
	PerlMonks