monkfan has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
Given a full string ($fseq) and its set of substrings (@$nsub).
And @$nsub always comes (extracted) from $fseq.Although they may or may not be the 'complete' substrings.
I wish to identify the overlapping region of the given set
of substrings and replace them with '-' (as shown with $result variable - the intended result). To be precise, I want to remove the overlapping region
of two subsequent strings in the set (@$nsub).
I really have no idea how can I approach this problem.
I have aligned manually the substring set (@$nsub) to show
how they are overlapping, as well as the alignment of
the desired result after removing the overlapping area.
I hope the examples are clear.
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my $fseq = 'CCCCGCGC';
my $nsub = ['CCCCG',
'CCCGC',
'CGCGC', ];
my $result =['C----', # this is not a parameter but
'----C', # intended result obtained with Dumper
'---GC' ];
my @result = remove_overlap($fseq,$nsub);
print Dumper \@result;
# I'm not so sure how to approach
# this problem with the subroutine below
sub remove_overlap
{
my ($fseq,$nsub) = @_;
my @result;
my $slen = length $nsub->[0];
foreach (@{$nsub})
{
# ????
my $pos = index($fseq,$_);
print "$pos\n";
}
return @result;
}
#---------- The other instances of the problem
#---------- with the intended $result
my $fseq2 = 'CGTGGCGC';
my $nsub2 = [ 'GTGGC',
'TGGCG'];
my $result2 =[ 'G----', # intended result
'----G'];
my $fseq3 = 'CCGCGCTC';
my $nsub3 = [ 'CCGCG',
'CGCGC',
'GCGCT',
'CGCTC'];
my $result3 = ['C----', # intended result
'----C',
'----T',
'----C'];
# No overlapping case
my $fseq4 = 'AAGCTATA';
my $nsub4 = [ 'AAGC',
'TATA'];
my $result4 = ['AAGC','TATA']; # intended result
Thanks so much beforehand.
Re: Identifying Overlapping Area in a Set of Strings
by rnahi (Curate) on Jul 29, 2005 at 09:27 UTC
|
Perhaps not the best efficient way, but it does what you ask:
use warnings;
use Data::Dumper;
my $fseq = 'CCCCGCGC';
my @nsub = ('CCCCG', 'CCCGC', 'CGCGC');
my @results;
for (1 .. $#nsub)
{
my $current = $nsub[$_];
my $previous = $nsub[$_ -1];
# the "#" is used as a separator.
# put a different character (or sequence of characters)
# if you believe that it could also be in your strings
if ( "$previous#$current" =~ /(\w+)#\1/ )
{
my $found = $1;
printf "%d -> %s (%s) %d -> %s \n",
$_ -1,
$previous,
$found,
$_,
$current;
$current =~ s/^$found/"-" x length($found)/e;
$previous =~ s/$found$/"-" x length($found)/e;
push @results, [ $_ -1, $previous];
push @results, [ $_, $current];
}
else
{
printf "%d -> no overlap\n", $_
}
}
print Data::Dumper->Dump([ \@results], ['result']);
Result: (adjusted)
0 -> CCCCG (CCCG) 1 -> CCCGC
1 -> CCCGC (CGC) 2 -> CGCGC
$result = [
[ 0, 'C----' ],
[ 1, '----C' ],
[ 1, 'CC---' ],
[ 2, '---GC' ]
];
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Hi rnahi,
Thanks so much for your answers. Your solutions
provide the correct result but also more.
my $fseq1= 'CCCCGCGC';
my @nsub1= ('CCCCG',
'CCCGC',
'CGCGC');
#produces
$result1 = [
[ 0, 'C----' ],
[ 1, '----C' ],
[ 1, 'CC---' ], # but this is extra
[ 2, '---GC' ]
];
And this
How can I modify your code such that it simply
gives:
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
How can I modify your code such that it simply gives: ...
Quite simple:
my @results;
my %seen;
for (1 .. $#nsub)
{
my $current = $nsub[$_];
my $previous = $nsub[$_ -1];
if ( "$previous#$current" =~ /(\w+)#\1/ )
{
my $found = $1;
printf "%d -> %s (%s) %d -> %s \n",
$_ -1,
$previous,
$found,
$_,
$current;
$current =~ s/^$found/"-" x length($found)/e;
$previous =~ s/$found$/"-" x length($found)/e;
push @results, [ $_ -1, $previous] unless $seen{$_ -1}++;
push @results, [ $_, $current] unless $seen{$_}++;
}
else
{
printf "%d -> no overlap\n", $_
}
}
print Data::Dumper->Dump([ \@results], ['result']);
| [reply] [Watch: Dir/Any] [d/l] |
|
Re: Identifying Overlapping Area in a Set of Strings
by reasonablekeith (Deacon) on Jul 29, 2005 at 09:31 UTC
|
#!/perl -w
use strict;
my $string = "abcdefghijklmnopqrstuvwxyz";
my @sub_strings = ('cdef', 'efghij', 'klmno', 'mnopqrst');
my $string_index = 0;
my $last_end_point = 0;
foreach my $sub_string (@sub_strings) {
my $string_index = index($string, $sub_string, $string_index);
die ("Couldn't find string") if $string_index == -1;
my $overlap = $last_end_point - $string_index;
if ($overlap > 0) {
substr($sub_string, 0, $overlap) = '-'x$overlap;
}
$last_end_point = $string_index + (length($sub_string));
}
print join(',', @sub_strings);
update: removed stupid variable declaration
---
my name's not Keith, and I'm not reasonable.
| [reply] [Watch: Dir/Any] [d/l] |
|
ORG: "GGGG";
SUB "GG"
"GG"; # no overlapping
or
"--"
"--"; # full overlapping
so depending on whether you start searching for the second GG _after_ the end of the first GG,
(ie from the third character onwards) or from the _beginning_ of first match (first character)
you get entirely different results, both of which are correct.
It's hard to say which is best, you need to look more closely at what you're trying to acheve.
regardless my updated code is as follows...
#!/perl -w
use strict;
my $string = "abcdefghijklmnopqrstuvwxyz";
my @sub_strings = ('cdef', 'efghij', 'klmno', 'mnopqrst');
my $string_index = 0;
my $last_end_point = index($string, $sub_strings[0]) + length($sub_str
+ings[0]);
foreach my $sub_array_index (1..$#sub_strings) {
my $string_index = index($string, $sub_strings[$sub_array_index],
+$string_index);
my $overlap = $last_end_point - $string_index;
# if there's an overlap replace from the front of current string,
+and the
# end of the previous string
if ($overlap > 0) {
substr($sub_strings[$sub_array_index], 0, $overlap) = '-'x$ove
+rlap;
substr($sub_strings[$sub_array_index-1], -$overlap, $overlap)
+= '-'x$overlap;
}
$last_end_point = $string_index + length($sub_strings[$sub_array_i
+ndex]);
}
updated mahi->rnahi after tlm's comments
---
my name's not Keith, and I'm not reasonable.
| [reply] [Watch: Dir/Any] [d/l] |
Re: Identifying Overlapping Area in a Set of Strings
by blazar (Canon) on Jul 29, 2005 at 08:54 UTC
|
Unfortunately I can't help you. Especially since I'm not really sure to have understood what your problem really is. Only, at first sight, it seems to me you're using quite a lot of arrayrefs where -unless I'm missing something obvious- a real array could be used instead. Any particular reason for this? Just curious...
More on topic, to identify overlapping regions of strings you may consider using the bitwise operators e.g. $str1^$str2 will give you a sequence of null chars where $str1 and $str2 overlap.
PS: one thing I don't understand is that you refer to "its set of substrings" but in the explicit examples you give it doesn't seem that @$nsub is the set of all substrings of the given string. Also you emphasize that you want to identify overlapping regions of "subsequent" strings, which suggests that order does matter, so that my guess is that you really have "a list of substrings of a given string". Is this more close to the point? | [reply] [Watch: Dir/Any] [d/l] [select] |
|
Hi,
it doesn't seem that @$nsub is the set of all substrings of the given string.
Yes you are right. They may or may not be the complete
set of substrings.
which suggests that order does matter, so that my guess is that you really have "a list of substrings of a given string".
Again you are right!. They are already "ordered",
that's why the set is an array.
| [reply] [Watch: Dir/Any] |
|
Incidentally this is why the computer-theoretic concept of array is not the most appropriate correspondent of the mathematical concept of set.
| [reply] [Watch: Dir/Any] |
Re: Identifying Overlapping Area in a Set of Strings (Off topic)
by PreferredUserName (Pilgrim) on Jul 29, 2005 at 13:53 UTC
|
Hey ewijaya,
Is there a job market for programmers who
are interested in this kind of thing, but
are not biologists/chemists by training/education?
I'm curious.
Thanks!
Considered by jdporter: Delete. Personal message.
Unconsidered by Arunbear: Question may be of interest to other monks; Keep/Edit/Delete: 5/1/7
| [reply] [Watch: Dir/Any] |
|
|