http://www.perlmonks.org?node_id=981452


in reply to Re^4: Comparing adjacent lines
in thread Comparing adjacent lines

Right you are, my apologies. The ex values that are not "ex1" (ex1 is produced at a mismatch in the comparison) revert to the default value of ex# (for example, ex18 or ex30) instead of counting up from 1. Basically, what I think is happening is ex1 is not passed on to the else part, since ex values get clobbered after one iteration of the loop. I've been trying variations of an ex variable external to the loop, but can't seem to figure it out.

Replies are listed 'Best First'.
Re^6: Comparing adjacent lines
by aaron_baugher (Curate) on Jul 13, 2012 at 02:02 UTC

    I'm still having trouble understanding what you mean. Please post your code, along with about a dozen lines of your input data that demonstrate the problem, as well as what you would like the output from those lines to be.

    Aaron B.
    Available for small or large Perl jobs; see my home node.

      Remember I am new at Perl, so my approach is probably pretty brutish.

      Input:

      ==> igenome.genes.onlyexons.uniq.gtf <== chr1 unknown exon1 815 4692 . - . gene_id "LOC +653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id " +TSS10265"; chr1 unknown exon2 4833 4901 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon3 5659 5810 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon4 6470 6628 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon5 6717 6918 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon6 7096 7231 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon7 7469 7605 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon8 7778 7924 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon9 8131 8229 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon10 8776 8938 . - . gene_id "L +OC653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id + "TSS10265"; chr1 unknown exon11 14601 14754 . - . gene_id +"LOC653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_ +id "TSS10265"; chr1 unknown exon12 19184 19919 . - . gene_id +"LOC653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_ +id "TSS10265"; chr1 unknown exon13 58954 59871 . + . gene_id +"OR4F5"; gene_name "OR4F5"; p_id "P1363"; transcript_id "NM_001005484 +.1"; tss_id "TSS13645"; chr1 unknown exon14 77385 80096 . + . gene_id +"LOC100132632"; gene_name "LOC100132632"; p_id "P27379"; transcript_i +d "XM_001724183.1"; tss_id "TSS10177"; chr1 unknown exon15 110381 110795 . - . gene_i +d "LOC643670"; gene_name "LOC643670"; transcript_id "XR_039242.1"; ts +s_id "TSS5420"; chr1 unknown exon16 114643 114701 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon17 118918 119086 . - . gene_i +d "LOC643670"; gene_name "LOC643670"; transcript_id "XR_039242.1"; ts +s_id "TSS5420"; chr1 unknown exon18 123237 130714 . - . gene_i +d "LOC643670"; gene_name "LOC643670"; transcript_id "XR_039242.1"; ts +s_id "TSS5420"; chr1 unknown exon19 123324 130714 . - . gene_i +d "LOC653340"; gene_name "LOC653340"; transcript_id "XR_039243.1"; ts +s_id "TSS5420"; chr1 unknown exon20 129326 129559 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon21 129653 129710 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon22 132810 132874 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon23 133719 134341 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon24 217633 218641 . - . gene_i +d "LOC728481"; gene_name "LOC728481"; transcript_id "XR_015292.2"; ts +s_id "TSS13278"; chr1 unknown exon25 313133 319752 . + . gene_i +d "LOC100132287"; gene_name "LOC100132287"; transcript_id "XR_039254. +1"; tss_id "TSS18904";
      #!/usr/bin/perl -w use strict; my $previous = ''; my $exon = 1; while (<>) { chomp; my(@gtf) = split; chop($gtf[8] = join ":", @gtf[8,9]); chop($gtf[10] = join ":", @gtf[10,11]); chop($gtf[12] = join ":", @gtf[12,13]); # Restart exon# for different gene names if ( $gtf[9] ne $previous ) { $gtf[2] = 'exon1'; } elsif ( $gtf[9] eq $previous ) { $exon =~ s/(\d+)/$1+1/e; } $previous = $gtf[9]; $exon = $gtf[2]; $gtf[8] = join "_", @gtf[8,10,12,2]; my $bed = join "\t", @gtf[0,3,4,8]; print "$bed\n"; }

      Here's 25 lines of the output (I'll bold the lines where the problem appears and simplify the data for formatting by removing array items 10 and 12):

      chr1 815 4692 gene_id:"LOC653635"_exon1

      chr1 4833 4901 gene_id:"LOC653635"_exon2

      chr1 5659 5810 gene_id:"LOC653635"_exon3

      chr1 6470 6628 gene_id:"LOC653635"_exon4

      chr1 6717 6918 gene_id:"LOC653635"_exon5

      chr1 7096 7231 gene_id:"LOC653635"_exon6

      chr1 7469 7605 gene_id:"LOC653635"_exon7

      chr1 7778 7924 gene_id:"LOC653635"_exon8

      chr1 8131 8229 gene_id:"LOC653635"_exon9

      chr1 8776 8938 gene_id:"LOC653635"_exon10

      chr1 14601 14754 gene_id:"LOC653635"_exon11

      chr1 19184 19919 gene_id:"LOC653635"_exon12

      chr1 58954 59871 gene_id:"OR4F5"_exon1

      chr1 77385 80096 gene_id:"LOC100132632"_exon1

      chr1 110381 110795 gene_id:"LOC643670"_exon1

      chr1 114643 114701 gene_id:"LOC729737"_exon1

      chr1 118918 119086 gene_id:"LOC643670"_exon1

      chr1 123237 130714 gene_id:"LOC643670"_exon18

      chr1 123324 130714 gene_id:"LOC653340"_exon1

      chr1 129326 129559 gene_id:"LOC729737"_exon1

      chr1 129653 129710 gene_id:"LOC729737"_exon21

      chr1 132810 132874 gene_id:"LOC729737"_exon22

      chr1 133719 134341 gene_id:"LOC729737"_exon23

      chr1 217633 218641 gene_id:"LOC728481"_exon1

      chr1 313133 319752 gene_id:"LOC100132287"_exon1

      The bolded lines should count up from 1, for example, "exon18" should read "exon2" and exon21, exon 22, exon23 should read exon2, exon3, exon4.

        The problem here is that you're getting your $exon variable confused with $gtf[2]. You're using $exon to hold the counter which you increment on matching lines, but then you clobber it by assigning $gtf[2] to it after that, but then you don't use it after that. Also, if you're going to use a counter, you don't need to use a regex to increment the digit part of the field. Also, there's no need for an elsif when there are only two possibilities (eq and ne).

        if ( $gtf[9] ne $previous ){ $exon = 1; # reset counter } else { $exon++; # increment counter } $gtf[2] = "exon$exon"; # put counter back in field

        Aaron B.
        Available for small or large Perl jobs; see my home node.