Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: Complicated pattern match

by scain (Curate)
on Jan 21, 2003 at 04:03 UTC ( #228578=note: print w/ replies, xml ) Need Help??


in reply to Complicated pattern match

OK, so there have been several responses so far, and it appears that at least some of them work :-)

In the spirit of TMTOWTDI, I thought I would take a different approach. The problems dr_jgbn presents could be used to do several things, though the most obvious is to look for exons that are included in one transcript and not another. Also, the concern that Abagail-II raises is valid, therefore, I think it is reasonable and desireable to use industry standard analysis tools to arrive at a solution. What follows is a fairly long description of how I solved that problem.

First, one starts with two sequences that are related to one another. Typically, one would determine that with BLASTN. For this example, I used two sequences that are clearly related:
LOCUS       AF418600                3389 bp    mRNA    linear   PRI 30-DEC-2002
DEFINITION  Homo sapiens ATP-binding cassette protein C13 (ABCC13) mRNA,
            complete cds.
ACCESSION   AF418600
VERSION     AF418600.1  GI:19851883
KEYWORDS    .
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 3389)
  AUTHORS   Yabuuchi,H., Takayanagi,S.-I., Yoshinaga,K., Taniguchi,N.,
            Aburatani,H. and Ishikawa,T.
  TITLE     ABCC13, an unusual truncated ABC transporter, is highly expressed
            in fetal human liver
  JOURNAL   Biochem. Biophys. Res. Commun. 299 (3), 410-417 (2002)
  MEDLINE   22334951
   PUBMED   12445816
REFERENCE   2  (bases 1 to 3389)
  AUTHORS   Yabuuchi,H., Takayanagi,S.-I. and Ishikawa,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (07-SEP-2001) Biomolecular Engineering, Tokyo Institute
            of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, Kanagawa
            226-8501, Japan
FEATURES             Location/Qualifiers
     source          1..3389
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /chromosome="21"
                     /map="21q11.2"
                     /tissue_type="placenta"
     gene            1..3389
                     /gene="ABCC13"
     CDS             294..1271
                     /gene="ABCC13"
                     /note="transporter"
                     /codon_start=1
                     /product="ATP-binding cassette protein C13"
                     /protein_id="AAL99903.1"
                     /db_xref="GI:19851884"
                     /translation="MLSSTQNAGGSYQRVRGALDTQKCSPEKSASFFSKVTYSWFSRV
                     ITLGYKRPLEREDLFELKESDSFCTACPIFEKQWRKEVLRNQERQKVKVSCYKEAHIK
                     KPSLLYALWNTFKSILIQVALFKVFADILSFTSPLIMKQIIIFCEHSSDFGWNGYGYA
                     VALLVVVFLQTLILQQYQRFNMLTSAKVKTAVNGLIYKKVSLATLCVYFLLDEGNILT
                     ATKVFTSMSLFNILRIPLFELPTVISAVVQTKISLGRLEDFLNTEELLPQSIETNYTG
                     DHAIGFTDASFSWDKTGMPVLKEALWLMFLSRPGFRIAFCKKTFSLAPS"
BASE COUNT     1087 a    639 c    698 g    965 t
ORIGIN      
        1 ttcccgtctc cctccacacc ttgcctcctc caggcccaac cccattccac cagcaccccg
       61 ccccaggcgc actggctagg actggaaggc aaggactaag gtgaaagctg actgcccgca
      121 ccgccccagg cgcactgggt aggactgaaa ggcagggact aaggtgacag ctgactctgc
      181 cggggcggag tgcgtgtcct cccaccgtgc tggcgctgaa ctgactgtcc gctgccaagg
      241 gaagtgacag ccgcagccgg gctctcagcc agcggccggg cgcccagcgg accatgctct
      301 ccagtacgca gaacgcgggc ggctcctatc agcgggtccg cggggcgctt gatacacaga
      361 aatgcagtcc agagaaaagt gcctcatttt tcagtaaagt gacatattcc tggtttagca
      421 gagtaattac tttaggctat aagagacctt tggaaagaga ggatcttttt gaactaaagg
      481 aaagtgattc cttctgcact gcgtgtccca tctttgaaaa acaatggaga aaggaagttt
      541 taaggaatca agagaggcaa aaagtaaagg tatcttgtta taaagaggca catatcaaga
      601 aaccatctct actctatgca ttgtggaaca cctttaaatc catcctgatt caagttgcct
      661 tattcaaagt gtttgctgat attttgtcct tcactagccc actcataatg aagcaaatta
      721 tcattttctg tgaacacagc tcagattttg gctggaatgg ctatggctat gcagtggcac
      781 ttcttgttgt agtctttttg caaactctga ttcttcagca atatcaacgt tttaacatgc
      841 tcacctcagc aaaagttaag acagctgtaa atggactgat ctacaaaaag gtgtccctgg
      901 caaccttgtg tgtctatttc ttactggatg aaggaaacat tttaacagcc actaaagtgt
      961 tcacatcgat gtctttgttt aatattttaa ggattcctct gtttgagtta ccaaccgtga
     1021 tctcagctgt ggtccagaca aagatatccc tgggccgttt ggaagacttt ctcaacactg
     1081 aggagcttct tcctcaaagt attgaaacga actatacagg agatcatgct attgggttta
     1141 cagatgcttc tttctcctgg gataaaacag gaatgccagt tctaaaagag gctctgtggc
     1201 ttatgtttct cagcaggcct ggattcagaa ttgcattttg caagaaaaca ttctctttgg
     1261 ctccatcatg aagaaagagt tttatgagca agtattggaa gcctgtgctc tccttccaga
     1321 tttggagcag ttgccaaagg gagatcaaac tgagattgga gaaagagtaa ggaaacggct
     1381 gtgaatataa gtgggggcca gcagcatcga gtaagcctgg ccagagctgt ctacagtggg
     1441 gccgacgtct acctcctgga tgatcccttg tctgctattg atgttcatgt tggaaagcag
     1501 ctttttgaaa aagtgatagg gtccttgggc cttttgaaaa accgggtagc catagtttca
     1561 tttatgcttt cttgtggggc aggggatata tctagcactg tgcctaaaat tccatttatt
     1621 ttgtttattt gttgaagcaa agtagtcaca tttcacatga aaaaaatggg atgctcagat
     1681 tgcaatttca caggcaaatg tcaatgggta atttttgaag ttttacttac aagtttactc
     1741 tggtgacaat ttatgtgatt gttatctctt catctaatac catatcctaa ggggatgtat
     1801 gtgccataat gaattgttgt tgactcaagt gatttttaca atggtatttt ggtgcacacc
     1861 aaaatattaa taaaatttta tattattaac atagcaattt gacattatta tcacagcaat
     1921 catagtgatt cagctgtttt gttaaagggc caagaagatg aaaattagtg tttaccagta
     1981 catgccatat ttcacagggt aaattagttc aatccctatt tcacacataa aggagctcag
     2041 ggctcacaat gacttgatat aggttgctca acttaagaac aatagagatt tttaaaagta
     2101 tgtctgactc caaagccaca ttcttcataa tatgtgttta tggaattatt aatatagaga
     2161 actttgaaaa tatatgtata aaattataaa atatatgtct cagtaaaaat aaacaaacat
     2221 ctttatatat tgacataaac ccaagattaa aaatgtcaca cttaattgta tgtatatgta
     2281 taaaataata tgaatgattt aaaaataagg actatgaatt atgacttgga ttaaacactt
     2341 tttttgtaat ggtactgtat atttaaaaat aagcaagaat gacatatatg aaattgtttt
     2401 aaatttgcaa ttaaattgat tcatttctaa tgaacgaaaa aatacccgaa ctaaagtgtt
     2461 aaaaatcaca tttagaaact attttctcat ctagcagcat gtcattattc cttgtaaaat
     2521 acatagatgc tgccttaaac tggcacctac caaaaatcat actctgaaat agtaaaggaa
     2581 aaaagtcatt gtagcttttt ttaaaaagcc atttttcgta atctctgaga taaaacactt
     2641 tgtttttcag actcatattt tagtgacaca caatctcaca cttctgccac aaatgaatct
     2701 tattgttgta atgaaaagtg gcagaatagc tcaaatggga atataccagg agctgctgtg
     2761 taaaaccaaa aatctcacta atttgcacca agtcatcagt gaacaagaaa aagctcatgc
     2821 tctaaagcaa gtcagtgcaa tcaactccag aactagacca aaagacaaaa tcctggagca
     2881 aaaacatagt gggatatgag tccccttctg agagtcatgc aaggatttct aggcactgtt
     2941 aagaacccac gtagaaagaa gatgttaagt gacggtagaa ccagttttca gggttgcctg
     3001 attttttgaa gcagcacttc acagaagtcg aggtgaccag gattgacatg caagttctaa
     3061 gagacaggcc ctcattggat caaggaaaac agctctcaat gaaaaaagaa aagatccctg
     3121 ttggtgggtt gaagttctcc atcattctgc agtacctcca agcctttggc tggctctggg
     3181 tgtggctgac tgtggtcact tacttagggc agaatttggt gagcattgga cagaatctat
     3241 ggcttagtgc atgggcaaaa gaagcaaaaa acatgaatga gttcacagaa tggaagcaaa
     3301 taagaagcaa taaacttaat atctatggat tattgggatt aataaaaggt aaaagaaaat
     3361 gtaaaaaaaa aaaaaaaaaa aaaaaaaaa
//
and
LOCUS       ABCC13                  1073 bp    mRNA    linear   PRI 23-DEC-2002
DEFINITION  Homo sapiens ATP-binding cassette, sub-family C (CFTR/MRP), member
            13 (ABCC13), transcript variant 3, mRNA.
ACCESSION   NM_172026
VERSION     NM_172026.1  GI:25952076
KEYWORDS    .
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1073)
  AUTHORS   Allikmets,R., Gerrard,B., Hutchinson,A. and Dean,M.
  TITLE     Characterization of the human ABC superfamily: isolation and
            mapping of 21 new genes using the expressed sequence tags database
  JOURNAL   Hum. Mol. Genet. 5 (10), 1649-1655 (1996)
  MEDLINE   97049974
   PUBMED   8894702
REFERENCE   2  (bases 1 to 1073)
  AUTHORS   Yabuuchi,H., Takayanagi,S., Yoshinaga,K., Taniguchi,N.,
            Aburatani,H. and Ishikawa,T.
  TITLE     ABCC13, an unusual truncated ABC transporter, is highly expressed
            in fetal human liver
  JOURNAL   Biochem. Biophys. Res. Commun. 299 (3), 410-417 (2002)
  MEDLINE   22334951
   PUBMED   12445816
COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
            reference sequence was derived from AF518320.1 and AF418600.1.
            Summary: The protein encoded by this gene is a member of the
            superfamily of ATP-binding cassette (ABC) transporters. ABC
            proteins transport various molecules across extra- and
            intra-cellular membranes. ABC genes are divided into seven distinct
            subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, and White). This
            ABC transporter is a member of the MRP subfamily which is involved
            in multi-drug resistance. Alternative splicing of this gene results
            in multiple transcript variants; however, not all variants have
            been fully described.
            Transcript Variant: This variant (3) lacks a segment in the coding
            region, compared to transcript variant 1. This results in a longer
            protein (isoform c) compared to isoform a. This variant (3) has
            been alternatively referred to as transcript variant C.
            COMPLETENESS: complete on the 3' end.
FEATURES             Location/Qualifiers
     source          1..1073
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /chromosome="21"
                     /map="21q11.1"
     gene            1..1073
                     /gene="ABCC13"
                     /note="synonyms: PRED6, C21orf73"
                     /db_xref="LocusID:150000"
     CDS             294..1031
                     /gene="ABCC13"
                     /note="ATP-binding cassette transporter C13; multidrug
                     resistance associated protein-like"
                     /codon_start=1
                     /product="ATP-binding cassette protein C13 isoform c"
                     /protein_id="NP_742023.1"
                     /db_xref="GI:25952077"
                     /db_xref="LocusID:150000"
                     /translation="MLSSTQNAGRLIHRVITLGYKRPLEREDLFELKESDSFCTACPI
                     FEKQWRKEVLRNQERQKVKVSCYKEAHIKKPSLLYALWNTFKSILIQVALFKVFADIL
                     SFTSPLIMKQIIIFCEHSSDFGWNGYGYAVALLVVVFLQTLILQQYQRFNMLTSAKVK
                     TAVNGLIYKKALLLSNVSRQKFSTGEIINLMSATHGLDSKPQSPLVCPFSNPNGRISP
                     LARAGSSSVSRGGSPCVCYTNKCFSCN"
     misc_feature    426..869
                     /gene="ABCC13"
                     /note="SunT; Region: ABC-type bacteriocin/lantibiotic
                     exporters, contain an N-terminal double-glycine peptidase
                     domain Defense mechanisms"
                     /db_xref="CDD:COG2274"
     misc_feature    516..869
                     /gene="ABCC13"
                     /note="MdlB; Region: ABC-type multidrug transport system,
                     ATPase and permease components Defense mechanisms"
                     /db_xref="CDD:COG1132"
     misc_feature    558..869
                     /gene="ABCC13"
                     /note="ABC_membrane; Region: ABC transporter transmembrane
                     region. This family represents a unit of six transmembrane
                     helices. Many members of the ABC transporter family
                     (pfam00005) have two such regions"
                     /db_xref="CDD:pfam00664"
     variation       889
                     /gene="ABCC13"
                     /allele="G"
                     /allele="A"
                     /db_xref="dbSNP:2822558"
     polyA_signal    1032..1037
                     /gene="ABCC13"
     polyA_site      1052
                     /gene="ABCC13"
BASE COUNT      306 a    261 c    238 g    268 t
ORIGIN      
        1 ttcccgtctc cctccacacc ttgcctcctc caggcccaac cccattccac cagcaccccg
       61 ccccaggcgc actggctagg actggaaggc aaggactaag gtgaaagctg actgcccgca
      121 ccgccccagg cgcactgggt aggactgaaa ggcagggact aaggtgacag ctgactctgc
      181 cggggcggag tgcgtgtcct cccaccgtgc tggcgctgaa ctgactgtcc gctgccaagg
      241 gaagtgacag ccgcagccgg gctctcagcc agcggccagg cgccccgcgg accatgctct
      301 ccagtacgca gaacgcgggg cgcttgatac acagagtaat tactttaggc tataagagac
      361 ctttggaaag agaggatctt tttgaactaa aggaaagtga ttccttctgc actgcgtgtc
      421 ccatctttga aaaacaatgg agaaaggaag ttttaaggaa tcaagagagg caaaaagtaa
      481 aggtatcttg ttataaagag gcacatatca agaaaccatc tctactctat gcattgtgga
      541 acacctttaa atccatcctg attcaagttg ccttattcaa agtgtttgct gatattttgt
      601 ccttcactag cccactcata atgaagcaaa ttatcatttt ctgtgaacac agctcagatt
      661 ttggctggaa tggctatggc tatgcagtgg cacttcttgt tgtagtcttt ttgcaaactc
      721 tgattcttca gcaatatcaa cgttttaaca tgctcacctc agcaaaagtt aagacagctg
      781 taaatggact gatctacaaa aaggccttac ttttatcaaa tgtttctcga caaaagtttt
      841 ccactgggga aattattaac ttgatgtcag caactcatgg acttgacagc aaacctcaat
      901 ctcctctggt ctgccccttt tcaaatccta atggccgtat atctcctttg gcaagagctg
      961 ggtccagcag tgttagcagg ggtggcagtc cttgtgtttg ttataccaat aaatgcttta
     1021 gctgcaacta aaataaaaaa gttaaaggta agaaaaaaaa aaaaaaaaaa aaa
//
Using these as starting sequences, I wrote a script that does a bl2seq (blast two sequences against each other), then look at the results, determine what is missing with respect to each sequence. The bl2seq output is shown here:
Query= a.seq
         (3389 letters)

>b.seq
          Length = 1073

 Score =  938 bits (473), Expect = 0.0
 Identities = 473/473 (100%)
 Strand = Plus / Plus


Query: 419 cagagtaattactttaggctataagagacctttggaaagagaggatctttttgaactaaa 478
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 332 cagagtaattactttaggctataagagacctttggaaagagaggatctttttgaactaaa 391


Query: 479 ggaaagtgattccttctgcactgcgtgtcccatctttgaaaaacaatggagaaaggaagt 538
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 392 ggaaagtgattccttctgcactgcgtgtcccatctttgaaaaacaatggagaaaggaagt 451


Query: 539 tttaaggaatcaagagaggcaaaaagtaaaggtatcttgttataaagaggcacatatcaa 598
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 452 tttaaggaatcaagagaggcaaaaagtaaaggtatcttgttataaagaggcacatatcaa 511


Query: 599 gaaaccatctctactctatgcattgtggaacacctttaaatccatcctgattcaagttgc 658
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 512 gaaaccatctctactctatgcattgtggaacacctttaaatccatcctgattcaagttgc 571


Query: 659 cttattcaaagtgtttgctgatattttgtccttcactagcccactcataatgaagcaaat 718
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 572 cttattcaaagtgtttgctgatattttgtccttcactagcccactcataatgaagcaaat 631


Query: 719 tatcattttctgtgaacacagctcagattttggctggaatggctatggctatgcagtggc 778
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 632 tatcattttctgtgaacacagctcagattttggctggaatggctatggctatgcagtggc 691


Query: 779 acttcttgttgtagtctttttgcaaactctgattcttcagcaatatcaacgttttaacat 838
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 692 acttcttgttgtagtctttttgcaaactctgattcttcagcaatatcaacgttttaacat 751


Query: 839 gctcacctcagcaaaagttaagacagctgtaaatggactgatctacaaaaagg 891
           |||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 752 gctcacctcagcaaaagttaagacagctgtaaatggactgatctacaaaaagg 804



 Score =  617 bits (311), Expect = e-180
 Identities = 317/319 (99%)
 Strand = Plus / Plus


Query: 1   ttcccgtctccctccacaccttgcctcctccaggcccaaccccattccaccagcaccccg 60
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 1   ttcccgtctccctccacaccttgcctcctccaggcccaaccccattccaccagcaccccg 60


Query: 61  ccccaggcgcactggctaggactggaaggcaaggactaaggtgaaagctgactgcccgca 120
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 61  ccccaggcgcactggctaggactggaaggcaaggactaaggtgaaagctgactgcccgca 120


Query: 121 ccgccccaggcgcactgggtaggactgaaaggcagggactaaggtgacagctgactctgc 180
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 121 ccgccccaggcgcactgggtaggactgaaaggcagggactaaggtgacagctgactctgc 180


Query: 181 cggggcggagtgcgtgtcctcccaccgtgctggcgctgaactgactgtccgctgccaagg 240
           ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 181 cggggcggagtgcgtgtcctcccaccgtgctggcgctgaactgactgtccgctgccaagg 240


Query: 241 gaagtgacagccgcagccgggctctcagccagcggccgggcgcccagcggaccatgctct 300
           ||||||||||||||||||||||||||||||||||||| ||||||| ||||||||||||||
Sbjct: 241 gaagtgacagccgcagccgggctctcagccagcggccaggcgccccgcggaccatgctct 300


Query: 301 ccagtacgcagaacgcggg 319
           |||||||||||||||||||
Sbjct: 301 ccagtacgcagaacgcggg 319



 Score = 44.1 bits (22), Expect = 2e-07
 Identities = 22/22 (100%)
 Strand = Plus / Plus


Query: 339 cgcggggcgcttgatacacaga 360
           ||||||||||||||||||||||
Sbjct: 314 cgcggggcgcttgatacacaga 335


Lambda     K      H
    1.37    0.711     1.31

Gapped
Lambda     K      H
    1.37    0.711     1.31


Matrix: blastn matrix:1 -3
Gap Penalties: Existence: 5, Extension: 2
Number of Hits to DB: 6
Number of Sequences: 0
Number of extensions: 6
Number of successful extensions: 6
Number of sequences better than 1.0e-04: 1
Number of HSP's better than  0.0 without gapping: 1
Number of HSP's successfully gapped in prelim test: 0
Number of HSP's that attempted gapping in prelim test: 0
Number of HSP's gapped (non-prelim): 4
length of query: 3389
length of database: 1073
effective HSP length: 10
effective length of query: 3379
effective length of database: 1063
effective search space:  3591877
effective search space used:  3591877
T: 0
A: 0
X1: 6 (11.9 bits)
X2: 15 (29.7 bits)
S1: 12 (24.3 bits)
S2: 18 (36.2 bits)
Finally, the code is below. It uses Bio::Tools::BPbl2seq, which is part of BioPerl. Note that the code could do several more things, like using BioPerl tools to do the blast on a remote server and turning the sequences into Bio::Seq objects, so that object methods could be used. For this example, that didn't seem necessary. There are also analysis options that one might want to consider; I'll address those after the output.
#!/usr/bin/perl -w use strict; use Bio::Tools::BPbl2seq; my $AFILE = 'a.seq'; my $BFILE = 'b.seq'; my $OUTFILE = 'bl2seq.out'; system('blast/bl2seq', '-i', $AFILE, '-j', $BFILE, '-o', $OUTFILE, '-p', 'blastn', '-e', '1e-4'); my $report = Bio::Tools::BPbl2seq->new(-file => $OUTFILE, -queryname => 'a.seq'); my (@Aranges,@Branges); while (my $hsp = $report->next_feature) { push @Aranges, [$hsp->query->start, $hsp->query->end]; push @Branges, [$hsp->sbjct->start, $hsp->sbjct->end]; } my $Arange_ref = \@Aranges; my $Brange_ref = \@Branges; my $sortedA = sort_array($Arange_ref); my $sortedB = sort_array($Brange_ref); my $Aseq = get_seq($AFILE); my $Bseq = get_seq($BFILE); my $Agaps = find_gaps(length $Aseq, $sortedA); my $Bgaps = find_gaps(length $Bseq, $sortedB); print_gaps('A',$Aseq,$Agaps); print_gaps('B',$Bseq,$Bgaps); #------------- sub sort_array { my $array_ref = shift; my @array = @{$array_ref}; for (my $i = 0; $i<((scalar @array)-1); $i++) { if ($array[$i][0] > $array[$i+1][0]) { ($array[$i], $array[$i+1]) = ($array[$i+1], $array[$i]); } } my $return_ref = \@array; return $return_ref; } #------------- sub find_gaps { my $len = shift; my $array_ref = shift; my @array = @{$array_ref}; my @gaps; if ($array[0][0] > 1) { push @gaps, (1, ($array[0]->[0]-1)); } for (my $i = 1; $i<(scalar @array); $i++) { if ($array[$i-1][1] < $array[$i][0]) { push @gaps, [ ($array[$i-1][1]+1), ($array[$i][0]-1) ]; } } if ($array[-1][1] < $len) { push @gaps, [ ($array[-1][1]+1), $len]; } my $return_ref = \@gaps; return $return_ref; } #------------ sub get_seq { my $filename = shift; my $seq = ''; open FILE, $filename or die "couldn't open $filename: $!\n"; while (<FILE>) { chomp; if (/^>/) { next; } $seq .= $_; } close FILE; $seq; } #------------ sub print_gaps { my $seqname = shift; my $seqin = shift; my $array_ref = shift; my @array = @{$array_ref}; if (scalar @array > 0) { print "$seqname gaps:\n"; foreach my $gap (@array) { my $start = ${$gap}[0]; my $end = ${$gap}[1]; my $seq = substr ($seqin, $start, ($end-$start) ); print "$start - $end:$seq\n"; } } } <code> Which results in this output: <code> A gaps: 320 - 338:ggctcctatcagcgggtc 361 - 418:atgcagtccagagaaaagtgcctcatttttcagtaaagtgacatattcctggtttag 892 - 3389:gtccctggcaaccttgtgtgtctatttcttactggatgaaggaaacattttaacagcca +ctaaagtgtt cacatcgatgtctttgtttaatattttaaggattcctctgtttgagttaccaaccgtgatctcagctgtg +gtccagacaa agatatccctgggccgtttggaagactttctcaacactgaggagcttcttcctcaaagtattgaaacgaa +ctatacagga gatcatgctattgggtttacagatgcttctttctcctgggataaaacaggaatgccagttctaaaagagg +ctctgtggct tatgtttctcagcaggcctggattcagaattgcattttgcaagaaaacattctctttggctccatcatga +agaaagagtt ttatgagcaagtattggaagcctgtgctctccttccagatttggagcagttgccaaagggagatcaaact +gagattggag aaagagtaaggaaacggctgtgaatataagtgggggccagcagcatcgagtaagcctggccagagctgtc +tacagtgggg ccgacgtctacctcctggatgatcccttgtctgctattgatgttcatgttggaaagcagctttttgaaaa +agtgataggg tccttgggccttttgaaaaaccgggtagccatagtttcatttatgctttcttgtggggcaggggatatat +ctagcactgt gcctaaaattccatttattttgtttatttgttgaagcaaagtagtcacatttcacatgaaaaaaatggga +tgctcagatt gcaatttcacaggcaaatgtcaatgggtaatttttgaagttttacttacaagtttactctggtgacaatt +tatgtgattg ttatctcttcatctaataccatatcctaaggggatgtatgtgccataatgaattgttgttgactcaagtg +atttttacaa tggtattttggtgcacaccaaaatattaataaaattttatattattaacatagcaatttgacattattat +cacagcaatc atagtgattcagctgttttgttaaagggccaagaagatgaaaattagtgtttaccagtacatgccatatt +tcacagggta aattagttcaatccctatttcacacataaaggagctcagggctcacaatgacttgatataggttgctcaa +cttaagaaca atagagatttttaaaagtatgtctgactccaaagccacattcttcataatatgtgtttatggaattatta +atatagagaa ctttgaaaatatatgtataaaattataaaatatatgtctcagtaaaaataaacaaacatctttatatatt +gacataaacc caagattaaaaatgtcacacttaattgtatgtatatgtataaaataatatgaatgatttaaaaataagga +ctatgaatta tgacttggattaaacactttttttgtaatggtactgtatatttaaaaataagcaagaatgacatatatga +aattgtttta aatttgcaattaaattgattcatttctaatgaacgaaaaaatacccgaactaaagtgttaaaaatcacat +ttagaaacta ttttctcatctagcagcatgtcattattccttgtaaaatacatagatgctgccttaaactggcacctacc +aaaaatcata ctctgaaatagtaaaggaaaaaagtcattgtagctttttttaaaaagccatttttcgtaatctctgagat +aaaacacttt gtttttcagactcatattttagtgacacacaatctcacacttctgccacaaatgaatcttattgttgtaa +tgaaaagtgg cagaatagctcaaatgggaatataccaggagctgctgtgtaaaaccaaaaatctcactaatttgcaccaa +gtcatcagtg aacaagaaaaagctcatgctctaaagcaagtcagtgcaatcaactccagaactagaccaaaagacaaaat +cctggagcaa aaacatagtgggatatgagtccccttctgagagtcatgcaaggatttctaggcactgttaagaacccacg +tagaaagaag atgttaagtgacggtagaaccagttttcagggttgcctgattttttgaagcagcacttcacagaagtcga +ggtgaccagg attgacatgcaagttctaagagacaggccctcattggatcaaggaaaacagctctcaatgaaaaaagaaa +agatccctgt tggtgggttgaagttctccatcattctgcagtacctccaagcctttggctggctctgggtgtggctgact +gtggtcactt acttagggcagaatttggtgagcattggacagaatctatggcttagtgcatgggcaaaagaagcaaaaaa +catgaatgag ttcacagaatggaagcaaataagaagcaataaacttaatatctatggattattgggattaataaaaggta +aaagaaaatg taaaaaaaaaaaaaaaaaaaaaaaaaaa B gaps: 805 - 1073:cttacttttatcaaatgtttctcgacaaaagttttccactggggaaattattaacttga +tgtcagcaac tcatggacttgacagcaaacctcaatctcctctggtctgccccttttcaaatcctaatggccgtatatct +cctttggcaa gagctgggtccagcagtgttagcaggggtggcagtccttgtgtttgttataccaataaatgctttagctg +caactaaaat aaaaaagttaaaggtaagaaaaaaaaaaaaaaaaaaaaa
So, ignoring the two end pieces that are different, there appear to be two short pieces of sequence A that contains sequence that is not contained in sequence B. There are several caveats that go along with this data. First, because BLAST filters out simple repeats, it is possible that some edges are missing. Also, there are frequently overlaps around the edges of BLAST hits, since the junctions around exons frequently look similar. So a more detailed analysis of the found exons is usually required. At a minimum, that analysis should including looking for likely splice sites. If the genome is available, the found exons could be blasted against the genome and the up and down stream regions could be cropped out and run through a splice finder. Along with the splice sites, it is frequently useful to look at the resulting translation into protein to see how the predicted splicing would affect the protein (ie, does it result in early termination, does it include or exclude a potentially important motif, etc).

Scott
Project coordinator of the Generic Model Organism Database Project


Comment on Re: Complicated pattern match
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://228578]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2014-08-01 04:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (256 votes), past polls