Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^7: Comparing adjacent lines

by daccame (Initiate)
on Jul 13, 2012 at 16:02 UTC ( [id://981654]=note: print w/replies, xml ) Need Help??


in reply to Re^6: Comparing adjacent lines
in thread Comparing adjacent lines

Remember I am new at Perl, so my approach is probably pretty brutish.

Input:

==> igenome.genes.onlyexons.uniq.gtf <== chr1 unknown exon1 815 4692 . - . gene_id "LOC +653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id " +TSS10265"; chr1 unknown exon2 4833 4901 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon3 5659 5810 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon4 6470 6628 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon5 6717 6918 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon6 7096 7231 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon7 7469 7605 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon8 7778 7924 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon9 8131 8229 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon10 8776 8938 . - . gene_id "L +OC653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id + "TSS10265"; chr1 unknown exon11 14601 14754 . - . gene_id +"LOC653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_ +id "TSS10265"; chr1 unknown exon12 19184 19919 . - . gene_id +"LOC653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_ +id "TSS10265"; chr1 unknown exon13 58954 59871 . + . gene_id +"OR4F5"; gene_name "OR4F5"; p_id "P1363"; transcript_id "NM_001005484 +.1"; tss_id "TSS13645"; chr1 unknown exon14 77385 80096 . + . gene_id +"LOC100132632"; gene_name "LOC100132632"; p_id "P27379"; transcript_i +d "XM_001724183.1"; tss_id "TSS10177"; chr1 unknown exon15 110381 110795 . - . gene_i +d "LOC643670"; gene_name "LOC643670"; transcript_id "XR_039242.1"; ts +s_id "TSS5420"; chr1 unknown exon16 114643 114701 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon17 118918 119086 . - . gene_i +d "LOC643670"; gene_name "LOC643670"; transcript_id "XR_039242.1"; ts +s_id "TSS5420"; chr1 unknown exon18 123237 130714 . - . gene_i +d "LOC643670"; gene_name "LOC643670"; transcript_id "XR_039242.1"; ts +s_id "TSS5420"; chr1 unknown exon19 123324 130714 . - . gene_i +d "LOC653340"; gene_name "LOC653340"; transcript_id "XR_039243.1"; ts +s_id "TSS5420"; chr1 unknown exon20 129326 129559 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon21 129653 129710 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon22 132810 132874 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon23 133719 134341 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon24 217633 218641 . - . gene_i +d "LOC728481"; gene_name "LOC728481"; transcript_id "XR_015292.2"; ts +s_id "TSS13278"; chr1 unknown exon25 313133 319752 . + . gene_i +d "LOC100132287"; gene_name "LOC100132287"; transcript_id "XR_039254. +1"; tss_id "TSS18904";
#!/usr/bin/perl -w use strict; my $previous = ''; my $exon = 1; while (<>) { chomp; my(@gtf) = split; chop($gtf[8] = join ":", @gtf[8,9]); chop($gtf[10] = join ":", @gtf[10,11]); chop($gtf[12] = join ":", @gtf[12,13]); # Restart exon# for different gene names if ( $gtf[9] ne $previous ) { $gtf[2] = 'exon1'; } elsif ( $gtf[9] eq $previous ) { $exon =~ s/(\d+)/$1+1/e; } $previous = $gtf[9]; $exon = $gtf[2]; $gtf[8] = join "_", @gtf[8,10,12,2]; my $bed = join "\t", @gtf[0,3,4,8]; print "$bed\n"; }

Here's 25 lines of the output (I'll bold the lines where the problem appears and simplify the data for formatting by removing array items 10 and 12):

chr1 815 4692 gene_id:"LOC653635"_exon1

chr1 4833 4901 gene_id:"LOC653635"_exon2

chr1 5659 5810 gene_id:"LOC653635"_exon3

chr1 6470 6628 gene_id:"LOC653635"_exon4

chr1 6717 6918 gene_id:"LOC653635"_exon5

chr1 7096 7231 gene_id:"LOC653635"_exon6

chr1 7469 7605 gene_id:"LOC653635"_exon7

chr1 7778 7924 gene_id:"LOC653635"_exon8

chr1 8131 8229 gene_id:"LOC653635"_exon9

chr1 8776 8938 gene_id:"LOC653635"_exon10

chr1 14601 14754 gene_id:"LOC653635"_exon11

chr1 19184 19919 gene_id:"LOC653635"_exon12

chr1 58954 59871 gene_id:"OR4F5"_exon1

chr1 77385 80096 gene_id:"LOC100132632"_exon1

chr1 110381 110795 gene_id:"LOC643670"_exon1

chr1 114643 114701 gene_id:"LOC729737"_exon1

chr1 118918 119086 gene_id:"LOC643670"_exon1

chr1 123237 130714 gene_id:"LOC643670"_exon18

chr1 123324 130714 gene_id:"LOC653340"_exon1

chr1 129326 129559 gene_id:"LOC729737"_exon1

chr1 129653 129710 gene_id:"LOC729737"_exon21

chr1 132810 132874 gene_id:"LOC729737"_exon22

chr1 133719 134341 gene_id:"LOC729737"_exon23

chr1 217633 218641 gene_id:"LOC728481"_exon1

chr1 313133 319752 gene_id:"LOC100132287"_exon1

The bolded lines should count up from 1, for example, "exon18" should read "exon2" and exon21, exon 22, exon23 should read exon2, exon3, exon4.

Replies are listed 'Best First'.
Re^8: Comparing adjacent lines
by aaron_baugher (Curate) on Jul 16, 2012 at 19:52 UTC

    The problem here is that you're getting your $exon variable confused with $gtf[2]. You're using $exon to hold the counter which you increment on matching lines, but then you clobber it by assigning $gtf[2] to it after that, but then you don't use it after that. Also, if you're going to use a counter, you don't need to use a regex to increment the digit part of the field. Also, there's no need for an elsif when there are only two possibilities (eq and ne).

    if ( $gtf[9] ne $previous ){ $exon = 1; # reset counter } else { $exon++; # increment counter } $gtf[2] = "exon$exon"; # put counter back in field

    Aaron B.
    Available for small or large Perl jobs; see my home node.

      Perfect!! Thank you.

      Most of those things you pointed out were an accumulation of different things I tried...and then forgot to clean up afterwards. You have been incredibly helpful, thanks again.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://981654]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-25 09:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found