Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Bracketing Substring(s) in the String

by monkfan (Curate)
on Aug 25, 2005 at 15:18 UTC ( #486585=perlquestion: print w/ replies, xml ) Need Help??
monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I am trying to put the bracket in a string given the set of its substrings. Like this:
String Substrings Desired Result 1.CCCATCTGTCCTTATTTGCTG ATCTG ATTTG CCC[ATCTG]TCCTT[ATTTG]CTG 2.ACCCATCTGTCCTTGGCCAT CCATC AC[CCATC]TGTCCTTGGCCAT 3.CCACCAGCACCTGTC CCACC CCAGC GCACC [CCACCAGCACC]TGTC * 4.CCCAACACCTGCTGCCT CCAAC ACACC C[CCAACACC]TGCTGCCT *
My code here does the job, but only for the first two cases (where the substrings don't overlap in the string). It fails to deal with case 3 and 4 (marked with asterisks *).
#!/usr/bin/perl -w use strict; my $s1 ='CCCATCTGTCCTTATTTGCTG'; my @a1 = qw(ATCTG ATTTG); my $s2 ='ACCCATCTGTCCTTGGCCAT'; my @a2 = qw(CCATC); my $s3 ='CCACCAGCACCTGTC'; my @a3 = qw(CCACC CCAGC GCACC); my $s4 ='CCCAACACCTGCTGCCT'; my @a4 = qw(CCAAC ACACC); #These two work fine. put_bracket($s1,\@a1); put_bracket($s2,\@a2); #But these two don't work put_bracket($s3,\@a3); put_bracket($s4,\@a4); sub put_bracket { my ($str,$ar) = @_; my $bstr; my $slen = length $ar->[0]; foreach my $subs ( @$ar ) { my $idx = index($str,$subs); my $bgn = $idx; my $end = $idx+$slen+1; substr($str,$bgn,0,"["); substr($str,$end,0,"]"); } print "$str\n"; return ; }

How can I resolve this problem? Am I approaching the problem correctly in the first place? I humbly seek your advice.

Regards,
Edward

Comment on Bracketing Substring(s) in the String
Select or Download Code
Re: Bracketing Substring(s) in the String
by SimonClinch (Chaplain) on Aug 25, 2005 at 15:27 UTC
    The code itself is okay, except that that arrays 3 and 4 do not contain a non-overlapping set of substrings of the string they are operating on and will therefore not meet the substitution criteria later in the program. To get the desired result:
    my @a3 = qw( CCACCAGCACC ); my @a4 = qw( CCAACACC );

    One world, one people

      SimonClinch, take a closer look at the actual original strings. The second two strings match multiple substrings, so the desired output actually shows this, but in a sort-of-confusing way. Take this example:
      STRING ------- GCGCTCGACGC SUBSTRINGS ---------- GCGC ACG == [GCGC]TC[ACG]C
      But, when we have OVERLAPPING sequences the output should 'mash-up' a bit:
      STRING ------- GCGCTCGACGC SUBSTRINGS ---------- GCGC GCTC == [GCGCTC]GACGC
      Do you see how GCGC AND GCTC MERGE into one single substring for the desired output?

      So I think the algorithm should look like this:

    • Make as many straight matches as you can
    • If your match is within a string that has already been matched, modify that match to include the new match

      Can you imagine how messy this would look if you had 100 substrings and a main string running 10,000 letters long (which I assume is possible because this stuff looks like gene sequence data)?

      Celebrate Intellectual Diversity

Re: Bracketing Substring(s) in the String
by thundergnat (Deacon) on Aug 25, 2005 at 16:32 UTC

    Rather than inserting the brackets as the substrings are found, you could keep track of the indicies of the substrings, filter out interior indicies for overlapping substrings, then insert the brackets at the remaining indicies.

    Lessee... somthing like...

    #!/usr/bin/perl -w use strict; my $s1 ='CCCATCTGTCCTTATTTGCTG'; my @a1 = qw(ATCTG ATTTG); my $s2 ='ACCCATCTGTCCTTGGCCAT'; my @a2 = qw(CCATC); my $s3 ='CCACCAGCACCTGTC'; my @a3 = qw(CCACC CCAGC GCACC); my $s4 ='CCCAACACCTGCTGCCT'; my @a4 = qw(CCAAC ACACC); put_bracket($s1,\@a1); put_bracket($s2,\@a2); put_bracket($s3,\@a3); put_bracket($s4,\@a4); sub put_bracket { my ($str,$ar) = @_; my $slen = length $ar->[0]; my @brackets; foreach my $subs ( @$ar ) { my $idx = index($str,$subs); my $bgn = $idx; my $end = $idx + $slen + 1; push @brackets, ($bgn, $end); } my @filtered = $brackets[0]; for (1..$#brackets-1) { push @filtered, $brackets[$_] if ($brackets[$_] < $brackets[$_+1] and $brackets[$_-1] < $brackets[$_]); } push @filtered, $brackets[-1]; while (@filtered) { substr($str, pop @filtered, 0, ']'); substr($str, pop @filtered, 0, '['); } print $str,"\n"; return; }

    Edit: Note that the above could return some incorrect brackets if all of the substrings are not the same length. It is probably moot though, as I see you assume that all of the substrings are the same length in your original example

Re: Bracketing Substring(s) in the String
by sgifford (Prior) on Aug 25, 2005 at 16:48 UTC
    The problem is that once you've added the brackets, index doesn't match anymore. Regular expressions can deal with this more easily. This code allows optional brackets between letters to find the substrings, then it "de-nests" the brackets to get the results you give above.
    sub put_bracket { my ($str,$ar) = @_; foreach my $subs ( @$ar ) { # Construct a regexp with [\[\]] between all the letters my $newsub = join('[\[\]]?',split(//,$subs)); $str =~ s/($newsub)/[$1]/g; } # Now de-nest the brackets in the string my $depth = 0; my $newstr = ''; foreach my $c (split(//,$str)) { if ($c eq "\[") { $newstr .= $c if ($depth++ == 0); } elsif ($c eq "\]") { $newstr .= $c if (--$depth == 0); } else { $newstr .= $c; } } print "$newstr\n"; return ; }
      Dear sgifford,
      First of all I want to apologize for having have to come back to you to ask this question after some time.

      Since your solution above is so important to me, I need to turn to you for this.
      I truly don't know how to go about it. I hope you won't mind.

      Your code above provide 99% correct solutions, except the following case.
      Example 1:
      #Given: my $s5 ='CTGGGTATGGGT'; my @a5 = qw(GTATG TGGGT);
      Your code above returns
      C[TGGGTATGGGT]
      Instead of this the correct one:
      CTGG[GTATGGGT]
      The explanation is as follows TGGGT occur twice in $s5.
      $s5 = "CTGGGTATGGGT"; TGGGT GTATG -- |--- Only this two satisfy. TGGGT -- Since it follows order and delim of the given array.
      Now why the latter is the correct answer. It is because in the array @a5 = qw(GTATG TGGGT), the string "TGGGT" comes after "GTATG", thus the bracketed region should also follow the order of the given array and the span delimited also by the array. By that I mean, the bracketed regions -- be it disjointed or overlapped -- should always start with first element of the array and end with the last element of the array.

      Let me give another examples, hope it clarifies.
      I would also need to state that the size length of the string in the array is always fixed. In our examples they are always of length 5.

      Is there a way I can modify your code above so that it can handle such case? Hope to hear from you again. I'll try not to bother you again after this.

      Update 2 : I think I've got the solution. Thanks so much sgifford, sorry for the trouble.

      Regards,
      Edward
Re: Bracketing Substring(s) in the String
by sapnac (Beadle) on Aug 25, 2005 at 17:36 UTC
    map{$s1 =~ s/$_/[$_]/g} @a1; map{$s2 =~ s/$_/[$_]/g} @a2; map{$s3 =~ s/$_/[$_]/g} @a3; map{$s4 =~ s/$_/[$_]/g} @a4; print join("\n",$s1,$s2,$s3,$s4)."\n"; gives; CCC[ATCTG]TCCTT[ATTTG]CTG AC[CCATC]TGTCCTTGGCCAT [CCACC]A[GCACC]TGTC C[CCAAC]ACCTGCTGCCT Is this what you want? If not, Please give what is expected to happen +for the last 2 strings. Hope this helps!
Re: Bracketing Substring(s) in the String
by robertlandrum (Initiate) on Aug 25, 2005 at 22:11 UTC
    Not exactly superman code, but it does what you want.
    sub put_bracket { my $s = shift; my @m = @_; my @indexes = (); for my $match (@m) { my $lm = 0; while(1) { my $i = index($s,$match,$lm); last if($i eq "-1"); push(@indexes,$i..($i+length($match)-1)); $lm = $i+1; } } my %tmp = map { $_ => 1 } @indexes; my @s = sort {$a <=> $b} keys %tmp; my $n = ''; my @chars = split(//,$s); my $marked = 0; for my $i (0..$#chars) { if($marked && $i == $s[0]) { shift @s; } elsif($i == $s[0]) { $n .= "["; shift @s; $marked = 1; } elsif($marked) { $n .= "]"; $marked = 0; } $n .= $chars[$i]; } return $n; }
Re: Bracketing Substring(s) in the String
by Anonymous Monk on Aug 25, 2005 at 22:12 UTC
    $ perl -le' my @x = ( [ q/CCCATCTGTCCTTATTTGCTG/, [ qw(ATCTG ATTTG) ] ], [ q/ACCCATCTGTCCTTGGCCAT/, [ qw(CCATC) ] ], [ q/CCACCAGCACCTGTC/, [ qw(CCACC CCAGC GCACC) ] ], [ q/CCCAACACCTGCTGCCT/, [ qw(CCAAC ACACC) ] ], ); for ( @x ) { my $str = $_->[0]; for my $subs ( @{ $_->[1] } ) { ( $regex = $subs ) =~ s/(.)(?=.)/$1\[][]?/g; $str =~ s{($regex)}{ ( $1 =~ /]/ ? "" : "[" ) . "$subs]" }eg; } print $str; } ' CCC[ATCTG]TCCTT[ATTTG]CTG AC[CCATC]TGTCCTTGGCCAT [CCACCAGCACC]TGTC C[CCAACACC]TGCTGCCT
Re: Bracketing Substring(s) in the String
by fizbin (Chaplain) on Aug 26, 2005 at 16:07 UTC
    The code posted so far has relied on doing one substitution at a time, modifying substitutions so that they can handle internal brackets.

    I'd like to propose a different approach, but it relies on one feature of the problem that you haven't made explicit: that all the strings you're dealing with, and all the substrings, are uppercase letters only. If so, then what we can do is make each match case-insensitive and lowercase any region that's found to match. Then, simply place brackets at uppercase/lowercase boundaries, and we're set:

    sub put_bracket { my ($str,$ar) = @_; foreach my $subs ( @$ar ) { my $lsub = lc($subs); s/$lsub/$lsub/ig; } # add brackets $str =~ s/(?:(?<=[A-Z])|^)(?=[a-z])/[/g; $str =~ s/(?<=[a-z])(?:(?=[A-Z])|$)/]/g; # re-uppercase $str = uc($str); print "$str\n"; return ; }
    And this passes all four of your test cases.

    But wait! Here's a test case that this code - and all the other code posted to this thread that I've tried - fails on:

    my $s5 ='CCACCACCACCTGTC'; my @a5 = qw(CCACC); put_bracket($s5,\@a5); # should be [CCACCACCACC]TGTC
    Oh dear, we didn't account for the case when the same substring overlaps itself. How are we going to handle this? Fortunately, it's not impossible. What we need to have happen is basically "repeat the substitution, but don't change anything already all in lowercase, until no changes are made". Fortunately, perl's got a syntax for that:
    sub put_bracket { my ($str,$ar) = @_; foreach my $subs ( @$ar ) { my $lsub = lc($subs); do {} while ($str =~ s/(?!$lsub)(?i:$lsub)/$lsub/g); } # add brackets $str =~ s/(?:(?<=[A-Z])|^)(?=[a-z])/[/g; $str =~ s/(?<=[a-z])(?:(?=[A-Z])|$)/]/g; # re-uppercase $str = uc($str); print "$str\n"; return ; }
    That new line says:

    • Repeat as many times as successful:
      • Look for a spot not followed exactly by $lsub, followed by $lsub in some case. (in other words, it finds $lsub not all in lowercase)
      • Replace this by (all lowercase) $lsub.
    -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://486585]
Approved by davido
Front-paged by injunjoel
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2014-10-21 00:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (93 votes), past polls