Perl-Sensitive Sunglasses PerlMonks

### intron length

by MBobur (Initiate)
 on Feb 27, 2013 at 06:43 UTC Need Help??
MBobur has asked for the wisdom of the Perl Monks concerning the following question:

I've got another data from where hopefully will be easier you guys to calculate intron length. It's "gtf" extension. Shortly it look like this.

```I    Cufflinks    transcript    218549    219145    1000    +    .
+gene_id "CUFF.54"; transcript_id "YAR062W"; FPKM "18.7631305676"; fra
+c "0.697479"; conf_lo "12.494853"; conf_hi "25.031408"; cov "15.86482

I    Cufflinks    exon    218549    219145    1000    +    .    gene_i
+d "CUFF.54"; transcript_id "YAR062W"; exon_number "1"; FPKM "18.76313
+05676"; frac "0.697479"; conf_lo "12.494853"; conf_hi "25.031408"; co
+v "15.864829";

I    Cufflinks    transcript    217157    217492    1000    -    .
+gene_id "CUFF.55"; transcript_id "YAR060C"; FPKM "23.1145085280"; fra
+c "0.302521"; conf_lo "11.245570"; conf_hi "34.983447"; cov "19.54405

I    Cufflinks    exon    217157    217492    1000    -    .    gene_i
+d "CUFF.55"; transcript_id "YAR060C"; exon_number "1"; FPKM "23.11450
+85280"; frac "0.302521"; conf_lo "11.245570"; conf_hi "34.983447"; co
+v "19.544059";

II    Cufflinks    transcript    7605    7733    1000    -    .    gen
+e_id "YBL108C-A"; transcript_id "YBL108C-A"; FPKM "18.0747134240"; fr
+ac "1.000000"; conf_lo "0.000000"; conf_hi "90.373567"; cov "8.950119

II    Cufflinks    exon    7605    7733    1000    -    .    gene_id "
+YBL108C-A"; transcript_id "YBL108C-A"; exon_number "1"; FPKM "18.0747
+134240"; frac "1.000000"; conf_lo "0.000000"; conf_hi "90.373567"; co
+v "8.950119";

II    Cufflinks    transcript    8177    8482    1    +    .    gene_i
+d "YBL108W"; transcript_id "YBL108W"; FPKM "0.0000000000"; frac "0.00
+0000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; full_r

II    Cufflinks    exon    8177    8482    1    +    .    gene_id "YBL
+108W"; transcript_id "YBL108W"; exon_number "1"; FPKM "0.0000000000";
+ frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.0000
+00";

II    Cufflinks    transcript    9268    9372    1    +    .    gene_i
+d "YBL107W-A"; transcript_id "YBL107W-A"; FPKM "0.0000000000"; frac "
+0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; fu

II    Cufflinks    exon    9268    9372    1    +    .    gene_id "YBL
+107W-A"; transcript_id "YBL107W-A"; exon_number "1"; FPKM "0.00000000
+00"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.
+000000";

II    Cufflinks    transcript    9583    9666    1    +    .    gene_i
+d "tL(UAA)B1"; transcript_id "tL(UAA)B1"; FPKM "0.0000000000"; frac "
+0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; fu

II    Cufflinks    exon    9583    9666    1    +    .    gene_id "tL(
+UAA)B1"; transcript_id "tL(UAA)B1"; exon_number "1"; FPKM "0.00000000
+00"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.
+000000";

I    Cufflinks    transcript    227742    228953    1000    +    .
+gene_id "CUFF.57"; transcript_id "YAR073W"; FPKM "5.8852029015"; frac
+ "0.892308"; conf_lo "3.699493"; conf_hi "8.070912"; cov "3.517134";

I    Cufflinks    exon    227742    228953    1000    +    .    gene_i
+d "CUFF.57"; transcript_id "YAR073W"; exon_number "1"; FPKM "5.885202
+9015"; frac "0.892308"; conf_lo "3.699493"; conf_hi "8.070912"; cov "
+3.517134";

II    Cufflinks    transcript    142112    142868    1000    -    .
+ gene_id "YBL040C"; transcript_id "YBL040C"; FPKM "246.2287249186"; f
+rac "0.288110"; conf_lo "225.158166"; conf_hi "267.299284"; cov "208.

II    Cufflinks    exon    142112    142749    1000    -    .    gene_
+id "YBL040C"; transcript_id "YBL040C"; exon_number "1"; FPKM "246.228
+7249186"; frac "0.288110"; conf_lo "225.158166"; conf_hi "267.299284"
+; cov "208.319539";

II    Cufflinks    exon    142847    142868    1000    -    .    gene_
+id "YBL040C"; transcript_id "YBL040C"; exon_number "2"; FPKM "246.228
+7249186"; frac "0.288110"; conf_lo "225.158166"; conf_hi "267.299284"
+; cov "208.319539";
Each lines starts with "I", "II" and etc. Here they may seen written in several lines. After each transcript there comes exon. I need to minus total exon length from transcription length to get intron length in most cases it's equal to zero but some how a few transcripts include couple of exons between which can be found gaps, which will be intron length.

Replies are listed 'Best First'.
Re: intron length
by marto (Bishop) on Feb 27, 2013 at 09:09 UTC

Thank you for your response. I got script which calculates full sequence length in fasta file format. But I want add script which will calculate intron length, which is "intron length = full length - exon length". Exon length = Transcription length.

```
#!/usr/bin/perl

use strict;
use warnings;

my \$fastaName = "example2.fasta";

open FASTA, \$fastaName
or die "cannot open \$fastaName\n";

my \$sequence = "";
my \$sequencelength;

while ( my \$line = <FASTA> ) {
chomp(\$line);
if ( substr( \$line, 0, 1 ) eq ">" ) {
if ( length(\$sequence) > 1 ) {

#Calculate sequence lenght
\$sequencelength = length(\$sequence);

print "Sequence length: \$sequencelength";
print "\n";

\$sequence = "";
}
print "\$line\n";
} else {
\$sequence .= \$line;
}
}

#Calculate sequence lenght for last sequence
\$sequencelength = length(\$sequence);

#Print sequence lenght
print "Sequence length: \$sequencelength";
print "\n";

"But I want add script which will calculate intron length, which is "intron length = full length - exon length". Exon length = Transcription length."

Your code makes no mention of intron, exon or transcription. FASTA_format makes no mention of these eiter. I, like most people am not a bioinformatician. You're either going to have to describe your problem better (see links I previously gave which describe this in detail, direct link) or wait for someone who is familiar with terms you're using who is willing to help.

It isn't clear what you are using here to determine the intron lengths. What was initially "posted" was output from cufflinks, which has your gene/transcript information and FPKM scores for each. Your newly posted script reads in a fasta file and determines the sequence length for each entry in the fasta file. Easy enough. The introns aren't marked in a fasta file, so I'm guessing that you'll use transcript information from the refFlat file from the UCSC genome browser or ensembl, etc. If you wanted to know the length of all exons combined for a given transcript (and ignoring any splicing variants, etc.) then you'll want to use the refFlat.txt file that can be downloaded from UCSC. It's easy to parse, and you can use the table browser to help figure out what values are in what columns (it notes where each exon begins and ends, for instance). You can import the data into a hash with the gene symbol or the accession number as a key and then calculate the exon/intron lengths for only the transcripts that you are interested in.

In the future, I'd try to post a bit more information and be careful to format it better on the site. People here are willing to help, but are less likely to do so if it annoys them. I can look at what you posted and see exactly what you are doing because I work with this type of data all day long; others may not but still have invaluable input in writing your scripts properly, so try to help them get on board. Good luck!

Bioinformatics
Re: intron length
by sundialsvc4 (Abbot) on Feb 27, 2013 at 14:10 UTC

A .GTF file?   Well, a tiny bit of Googling sent me to this page at this very interesting-looking site:   http://www.bioperl.org/wiki/GTF.

In fact, the first page of my Google search pointed to an abundance of pages, and Perl packages including Bio::FeatureIO::GTF which ... although I am not a biologist ... do strongly suggest to me that the problem you are trying to solve is one that has been thoroughly solved before.   That is usually the case with Perl.

Create A New User
Node Status?
node history
Node Type: perlquestion [id://1020810]
Approved by Corion
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (6)
As of 2017-08-18 10:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
Who is your favorite scientist and why?

Results (299 votes). Check out past polls.

Notices?