Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Identifying Overlapping Area in a Set of Strings

by monkfan (Curate)
on Jul 29, 2005 at 08:48 UTC ( #479296=perlquestion: print w/ replies, xml ) Need Help??
monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
Given a full string ($fseq) and its set of substrings (@$nsub). And @$nsub always comes (extracted) from $fseq.Although they may or may not be the 'complete' substrings.
I wish to identify the overlapping region of the given set of substrings and replace them with '-' (as shown with $result variable - the intended result). To be precise, I want to remove the overlapping region of two subsequent strings in the set (@$nsub).

I really have no idea how can I approach this problem.
I have aligned manually the substring set (@$nsub) to show how they are overlapping, as well as the alignment of the desired result after removing the overlapping area. I hope the examples are clear.
#!/usr/bin/perl -w use strict; use Data::Dumper; my $fseq = 'CCCCGCGC'; my $nsub = ['CCCCG', 'CCCGC', 'CGCGC', ]; my $result =['C----', # this is not a parameter but '----C', # intended result obtained with Dumper '---GC' ]; my @result = remove_overlap($fseq,$nsub); print Dumper \@result; # I'm not so sure how to approach # this problem with the subroutine below sub remove_overlap { my ($fseq,$nsub) = @_; my @result; my $slen = length $nsub->[0]; foreach (@{$nsub}) { # ???? my $pos = index($fseq,$_); print "$pos\n"; } return @result; } #---------- The other instances of the problem #---------- with the intended $result my $fseq2 = 'CGTGGCGC'; my $nsub2 = [ 'GTGGC', 'TGGCG']; my $result2 =[ 'G----', # intended result '----G']; my $fseq3 = 'CCGCGCTC'; my $nsub3 = [ 'CCGCG', 'CGCGC', 'GCGCT', 'CGCTC']; my $result3 = ['C----', # intended result '----C', '----T', '----C']; # No overlapping case my $fseq4 = 'AAGCTATA'; my $nsub4 = [ 'AAGC', 'TATA']; my $result4 = ['AAGC','TATA']; # intended result
Thanks so much beforehand.
Regards,
Edward

Comment on Identifying Overlapping Area in a Set of Strings
Download Code
Re: Identifying Overlapping Area in a Set of Strings
by blazar (Canon) on Jul 29, 2005 at 08:54 UTC
    Unfortunately I can't help you. Especially since I'm not really sure to have understood what your problem really is. Only, at first sight, it seems to me you're using quite a lot of arrayrefs where -unless I'm missing something obvious- a real array could be used instead. Any particular reason for this? Just curious...

    More on topic, to identify overlapping regions of strings you may consider using the bitwise operators e.g. $str1^$str2 will give you a sequence of null chars where $str1 and $str2 overlap.

    PS: one thing I don't understand is that you refer to "its set of substrings" but in the explicit examples you give it doesn't seem that @$nsub is the set of all substrings of the given string. Also you emphasize that you want to identify overlapping regions of "subsequent" strings, which suggests that order does matter, so that my guess is that you really have "a list of substrings of a given string". Is this more close to the point?

      Hi,
      it doesn't seem that @$nsub is the set of all substrings of the given string.
      Yes you are right. They may or may not be the complete set of substrings.
      which suggests that order does matter, so that my guess is that you really have "a list of substrings of a given string".
      Again you are right!. They are already "ordered", that's why the set is an array.
      Regards,
      Edward
        Incidentally this is why the computer-theoretic concept of array is not the most appropriate correspondent of the mathematical concept of set.
Re: Identifying Overlapping Area in a Set of Strings
by rnahi (Curate) on Jul 29, 2005 at 09:27 UTC

    Perhaps not the best efficient way, but it does what you ask:

    use warnings; use Data::Dumper; my $fseq = 'CCCCGCGC'; my @nsub = ('CCCCG', 'CCCGC', 'CGCGC'); my @results; for (1 .. $#nsub) { my $current = $nsub[$_]; my $previous = $nsub[$_ -1]; # the "#" is used as a separator. # put a different character (or sequence of characters) # if you believe that it could also be in your strings if ( "$previous#$current" =~ /(\w+)#\1/ ) { my $found = $1; printf "%d -> %s (%s) %d -> %s \n", $_ -1, $previous, $found, $_, $current; $current =~ s/^$found/"-" x length($found)/e; $previous =~ s/$found$/"-" x length($found)/e; push @results, [ $_ -1, $previous]; push @results, [ $_, $current]; } else { printf "%d -> no overlap\n", $_ } } print Data::Dumper->Dump([ \@results], ['result']);

    Result: (adjusted)

    0 -> CCCCG (CCCG) 1 -> CCCGC 1 -> CCCGC (CGC) 2 -> CGCGC $result = [ [ 0, 'C----' ], [ 1, '----C' ], [ 1, 'CC---' ], [ 2, '---GC' ] ];
      Hi rnahi,
      Thanks so much for your answers. Your solutions provide the correct result but also more.
      my $fseq1= 'CCCCGCGC'; my @nsub1= ('CCCCG', 'CCCGC', 'CGCGC'); #produces $result1 = [ [ 0, 'C----' ], [ 1, '----C' ], [ 1, 'CC---' ], # but this is extra [ 2, '---GC' ] ];
      And this

      How can I modify your code such that it simply gives:
      Regards,
      Edward
        How can I modify your code such that it simply gives: ...

        Quite simple:

        my @results; my %seen; for (1 .. $#nsub) { my $current = $nsub[$_]; my $previous = $nsub[$_ -1]; if ( "$previous#$current" =~ /(\w+)#\1/ ) { my $found = $1; printf "%d -> %s (%s) %d -> %s \n", $_ -1, $previous, $found, $_, $current; $current =~ s/^$found/"-" x length($found)/e; $previous =~ s/$found$/"-" x length($found)/e; push @results, [ $_ -1, $previous] unless $seen{$_ -1}++; push @results, [ $_, $current] unless $seen{$_}++; } else { printf "%d -> no overlap\n", $_ } } print Data::Dumper->Dump([ \@results], ['result']);
Re: Identifying Overlapping Area in a Set of Strings
by reasonablekeith (Deacon) on Jul 29, 2005 at 09:31 UTC
    Well the following replaces the front of the strings as per your spec, I couldn't be arsed to do the end of the strings, as you need to know in advance where the next string starts (and if its before the endpoint of your current string). If might be best to search first and save the start/end positions of the substrings first and then go through replacing values.

    This should be a good start though.

    #!/perl -w use strict; my $string = "abcdefghijklmnopqrstuvwxyz"; my @sub_strings = ('cdef', 'efghij', 'klmno', 'mnopqrst'); my $string_index = 0; my $last_end_point = 0; foreach my $sub_string (@sub_strings) { my $string_index = index($string, $sub_string, $string_index); die ("Couldn't find string") if $string_index == -1; my $overlap = $last_end_point - $string_index; if ($overlap > 0) { substr($sub_string, 0, $overlap) = '-'x$overlap; } $last_end_point = $string_index + (length($sub_string)); } print join(',', @sub_strings);
    update: removed stupid variable declaration
    ---
    my name's not Keith, and I'm not reasonable.
      with inspiration from rnahi i've updated my script to do all that you want. It ought to be a bit quicker that rnahi's just because of the use if index over regular expressions.

      Anyway a BIG caveat to all this is that these result are ambiguous if there could be multiple matches for a given substring. For example searching for "GG" twice in "GGGG" could give you.

       ORG:  "GGGG";
       SUB   "GG"
               "GG"; # no overlapping
       or
             "--"
             "--";  # full overlapping
      
      so depending on whether you start searching for the second GG _after_ the end of the first GG, (ie from the third character onwards) or from the _beginning_ of first match (first character) you get entirely different results, both of which are correct.

      It's hard to say which is best, you need to look more closely at what you're trying to acheve.

      regardless my updated code is as follows...

      #!/perl -w use strict; my $string = "abcdefghijklmnopqrstuvwxyz"; my @sub_strings = ('cdef', 'efghij', 'klmno', 'mnopqrst'); my $string_index = 0; my $last_end_point = index($string, $sub_strings[0]) + length($sub_str +ings[0]); foreach my $sub_array_index (1..$#sub_strings) { my $string_index = index($string, $sub_strings[$sub_array_index], +$string_index); my $overlap = $last_end_point - $string_index; # if there's an overlap replace from the front of current string, +and the # end of the previous string if ($overlap > 0) { substr($sub_strings[$sub_array_index], 0, $overlap) = '-'x$ove +rlap; substr($sub_strings[$sub_array_index-1], -$overlap, $overlap) += '-'x$overlap; } $last_end_point = $string_index + length($sub_strings[$sub_array_i +ndex]); }
      updated mahi->rnahi after tlm's comments
      ---
      my name's not Keith, and I'm not reasonable.
Re: Identifying Overlapping Area in a Set of Strings (Off topic)
by PreferredUserName (Pilgrim) on Jul 29, 2005 at 13:53 UTC
    Hey ewijaya,

    Is there a job market for programmers who are interested in this kind of thing, but are not biologists/chemists by training/education?

    I'm curious.

    Thanks!

    Considered by jdporter: Delete. Personal message.
    Unconsidered by Arunbear: Question may be of interest to other monks; Keep/Edit/Delete: 5/1/7

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://479296]
Approved by Tanalis
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (16)
As of 2014-07-28 15:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (201 votes), past polls