Re^5: Comparing adjacent lines

in reply to Re^4: Comparing adjacent lines
in thread Comparing adjacent lines

Right you are, my apologies. The ex values that are not "ex1" (ex1 is produced at a mismatch in the comparison) revert to the default value of ex# (for example, ex18 or ex30) instead of counting up from 1. Basically, what I think is happening is ex1 is not passed on to the else part, since ex values get clobbered after one iteration of the loop. I've been trying variations of an ex variable external to the loop, but can't seem to figure it out.

Comment on Re^5: Comparing adjacent lines Download Code

Replies are listed 'Best First'.
Re^6: Comparing adjacent lines by aaron_baugher (Curate) on Jul 13, 2012 at 02:02 UTC
I'm still having trouble understanding what you mean. Please post your code, along with about a dozen lines of your input data that demonstrate the problem, as well as what you would like the output from those lines to be. Aaron B. Available for small or large Perl jobs; see my home node.	[reply]
Re^7: Comparing adjacent lines by daccame (Initiate) on Jul 13, 2012 at 16:02 UTC
Remember I am new at Perl, so my approach is probably pretty brutish. Input: ==> igenome.genes.onlyexons.uniq.gtf <== chr1 unknown exon1 815 4692 . - . gene_id "LOC +653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id " +TSS10265"; chr1 unknown exon2 4833 4901 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon3 5659 5810 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon4 6470 6628 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon5 6717 6918 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon6 7096 7231 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon7 7469 7605 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon8 7778 7924 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon9 8131 8229 . - . gene_id "LO +C653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id +"TSS10265"; chr1 unknown exon10 8776 8938 . - . gene_id "L +OC653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_id + "TSS10265"; chr1 unknown exon11 14601 14754 . - . gene_id +"LOC653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_ +id "TSS10265"; chr1 unknown exon12 19184 19919 . - . gene_id +"LOC653635"; gene_name "LOC653635"; transcript_id "XR_017611.2"; tss_ +id "TSS10265"; chr1 unknown exon13 58954 59871 . + . gene_id +"OR4F5"; gene_name "OR4F5"; p_id "P1363"; transcript_id "NM_001005484 +.1"; tss_id "TSS13645"; chr1 unknown exon14 77385 80096 . + . gene_id +"LOC100132632"; gene_name "LOC100132632"; p_id "P27379"; transcript_i +d "XM_001724183.1"; tss_id "TSS10177"; chr1 unknown exon15 110381 110795 . - . gene_i +d "LOC643670"; gene_name "LOC643670"; transcript_id "XR_039242.1"; ts +s_id "TSS5420"; chr1 unknown exon16 114643 114701 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon17 118918 119086 . - . gene_i +d "LOC643670"; gene_name "LOC643670"; transcript_id "XR_039242.1"; ts +s_id "TSS5420"; chr1 unknown exon18 123237 130714 . - . gene_i +d "LOC643670"; gene_name "LOC643670"; transcript_id "XR_039242.1"; ts +s_id "TSS5420"; chr1 unknown exon19 123324 130714 . - . gene_i +d "LOC653340"; gene_name "LOC653340"; transcript_id "XR_039243.1"; ts +s_id "TSS5420"; chr1 unknown exon20 129326 129559 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon21 129653 129710 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon22 132810 132874 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon23 133719 134341 . - . gene_i +d "LOC729737"; gene_name "LOC729737"; p_id "P4208"; transcript_id "XM +_001133863.1"; tss_id "TSS9238"; chr1 unknown exon24 217633 218641 . - . gene_i +d "LOC728481"; gene_name "LOC728481"; transcript_id "XR_015292.2"; ts +s_id "TSS13278"; chr1 unknown exon25 313133 319752 . + . gene_i +d "LOC100132287"; gene_name "LOC100132287"; transcript_id "XR_039254. +1"; tss_id "TSS18904"; [download] `#!/usr/bin/perl -w use strict; my $previous = ''; my $exon = 1; while (<>) { chomp; my(@gtf) = split; chop($gtf[8] = join ":", @gtf[8,9]); chop($gtf[10] = join ":", @gtf[10,11]); chop($gtf[12] = join ":", @gtf[12,13]); # Restart exon# for different gene names if ( $gtf[9] ne $previous ) { $gtf[2] = 'exon1'; } elsif ( $gtf[9] eq $previous ) { $exon =~ s/(\d+)/$1+1/e; } $previous = $gtf[9]; $exon = $gtf[2]; $gtf[8] = join "_", @gtf[8,10,12,2]; my $bed = join "\t", @gtf[0,3,4,8]; print "$bed\n"; }` [download] Here's 25 lines of the output (I'll bold the lines where the problem appears and simplify the data for formatting by removing array items 10 and 12): chr1 815 4692 gene_id:"LOC653635"_exon1 chr1 4833 4901 gene_id:"LOC653635"_exon2 chr1 5659 5810 gene_id:"LOC653635"_exon3 chr1 6470 6628 gene_id:"LOC653635"_exon4 chr1 6717 6918 gene_id:"LOC653635"_exon5 chr1 7096 7231 gene_id:"LOC653635"_exon6 chr1 7469 7605 gene_id:"LOC653635"_exon7 chr1 7778 7924 gene_id:"LOC653635"_exon8 chr1 8131 8229 gene_id:"LOC653635"_exon9 chr1 8776 8938 gene_id:"LOC653635"_exon10 chr1 14601 14754 gene_id:"LOC653635"_exon11 chr1 19184 19919 gene_id:"LOC653635"_exon12 chr1 58954 59871 gene_id:"OR4F5"_exon1 chr1 77385 80096 gene_id:"LOC100132632"_exon1 chr1 110381 110795 gene_id:"LOC643670"_exon1 chr1 114643 114701 gene_id:"LOC729737"_exon1 chr1 118918 119086 gene_id:"LOC643670"_exon1 chr1 123237 130714 gene_id:"LOC643670"_exon18 chr1 123324 130714 gene_id:"LOC653340"_exon1 chr1 129326 129559 gene_id:"LOC729737"_exon1 chr1 129653 129710 gene_id:"LOC729737"_exon21 chr1 132810 132874 gene_id:"LOC729737"_exon22 chr1 133719 134341 gene_id:"LOC729737"_exon23 chr1 217633 218641 gene_id:"LOC728481"_exon1 chr1 313133 319752 gene_id:"LOC100132287"_exon1 The bolded lines should count up from 1, for example, "exon18" should read "exon2" and exon21, exon 22, exon23 should read exon2, exon3, exon4.	[reply] [d/l] [select]
Re^8: Comparing adjacent lines by aaron_baugher (Curate) on Jul 16, 2012 at 19:52 UTC
The problem here is that you're getting your `$exon` variable confused with `$gtf[2]`. You're using `$exon` to hold the counter which you increment on matching lines, but then you clobber it by assigning `$gtf[2]` to it after that, but then you don't use it after that. Also, if you're going to use a counter, you don't need to use a regex to increment the digit part of the field. Also, there's no need for an elsif when there are only two possibilities (eq and ne). `if ( $gtf[9] ne $previous ){ $exon = 1; # reset counter } else { $exon++; # increment counter } $gtf[2] = "exon$exon"; # put counter back in field` [download] Aaron B. Available for small or large Perl jobs; see my home node.	[reply] [d/l] [select]
Re^9: Comparing adjacent lines by daccame (Initiate) on Jul 16, 2012 at 22:06 UTC

In Section Seekers of Perl Wisdom