http://www.perlmonks.org?node_id=973632

ssc11008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I want to have a regular expression that can seek all possible version of my string. (including overlapping part of my sequences) My string is dynamic, user can define it. I have a list of strings to try my input string. for example: user defined strings could be AAA AAC ACA CAA etc my strings that I want to search these are AAAAA AAACACA CAACAAA return should be AAA count: 5 AAC count: 2 ACA count: 3 CAA count: 2 I only need the regular expression, If you help me, I'll realy appreciate Thanks

Replies are listed 'Best First'.
Re: Regex: finding all possible substrings
by davido (Cardinal) on May 31, 2012 at 23:32 UTC

    If I understand correctly, you want to look at a substring of three characters as position '0' in your haystack, and see if it matches some permutation of the characters in (A,A,C). If so, do something. Then advance your position to 1, and check again. This is possible with a regex, but I think that it reads clearer (and is probably no less efficient) if you use substr to check each position within your haystack against a hash holding all permutations of your "needle" characters:

    use Algorithm::Permute; my $needle_chars = 'AAC'; my $haystack = 'AAACACAA'; my $p = Algorithm::Permute->new( [ split //, $needle_chars ], 3 ); my %perms; while( my @perm = $p->next ) { $perms{ join '', @perm }++; } my $pos = 0; while( $pos + 3 < length $haystack ) { my $pos_chars = substr $haystack, $pos, 3; print $pos_chars, " found at $pos\n" if exists $perms{ $pos_chars }; $pos++; }

    I don't know how big your set of needle characters really is. Perhaps instead of three it's really 100 characters, in which case holding onto all the permutations is impractical. But if the number of characters you are permuting is not too big this is an efficient solution.


    Dave

      and see if it matches some permutation

      Of course, we're all kind of guessing here because that isn't exactly a well-written spec... but I think assuming he wants all permutations is a bit of a leap. I think it may also be a leap to assume he only wants to search for strings of a certain length. He just says that they are "user defined."

      I still think this is the right approach though, he just has to track the lengths of his needles and be sure to check substrings of each necessary length (so long as it doesn't go off the end of his haystack.)

      Something like this:

      #!/usr/bin/perl use warnings; use strict; my $haystack = shift; my %needles; undef @needles{@ARGV}; my @len = sort {$a<=>$b} keys %{{ map { (length,0) } keys %needles }}; my $pos = 0; my $hlen = length $haystack; while ($pos + $len[0] <= $hlen) { for my $L (@len) { last if $pos + $L > $hlen; my $substr = substr($haystack, $pos, $L); $needles{$substr}++ if exists $needles{$substr}; } $pos++; } use Data::Dumper; print Dumper \%needles;

      Update:

      $ ./973643.pl 'AAAAA AAACACA CAACAAA' AAA AAC ACA CAA $VAR1 = { 'AAC' => 2, 'ACA' => 3, 'CAA' => 2, 'AAA' => 5 };

      -sauoq
      "My two cents aren't worth a dime.";

        Of course, we're all kind of guessing here because that isn't exactly a well-written spec...

        That's a pretty good assessment, and I appreciate being granted a little latitude in my interpretation thereof. About the best I could do was guess, and then do a better job than the OP of documenting the criteria I came up with. :) When posts like this come up I have to make a decision whether to take a stab at trying to guess at a more refined specification, or to post a node seeking clarification (which often never comes), or to just let the question go and continue about the work that the question was distracting me from in the first place.

        Sometimes what tips the scales for me is if I find a little amusement in the diversion of coming up with a solution to the specification that I invented by venturing a guess. My hope is that it's also helpful.


        Dave

Re: Regex: finding all possible substrings
by kennethk (Abbot) on May 31, 2012 at 23:00 UTC

    What have you tried? What didn't work? We are not a code writing service.

    If you need to generate permutations, you can do that pretty easily with glob. That combines with regular expressions pretty easily - have you read perlretut? Also, for your example, I can't get your count of 5 for AAA -- I see either 4 or 6 depending on how you count overlaps.

    Update: Stupid failing eyes. I counted two AAAs in the third word.

    Node text goes above. Div tags should contain sig only

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      I can't get your count of 5 for AAA

      Assuming his spaces are significant, there are 5.

      But there are 3 "ACA"s. And that's what he said. My bad.

      -sauoq
      "My two cents aren't worth a dime.";
Re: Regex: finding all possible substrings
by AnomalousMonk (Archbishop) on Jun 01, 2012 at 01:10 UTC

    Based on my interpretation of the OPed quasi-spec:

    >perl -wMstrict -le "my $valid_string = qr{ \A [ACGT]+ \z }xms; ;; for my $user_subseq (qw(AAA AAC ACA CAA ABC)) { validate($user_subseq, $valid_string); my $n = 0; for my $string (qw(AAAAA AAACACA CAACAAA)) { $n +=()= $string =~ m{ (?= ($user_subseq)) }xmsg; } print qq{in all strings, '$user_subseq' = $n}; } ;; sub validate { my ($user_supplied_string, $valid) = @_; return if $user_supplied_string =~ $valid; die qq{bad user supplied string '$user_supplied_string'} } " in all strings, 'AAA' = 5 in all strings, 'AAC' = 2 in all strings, 'ACA' = 3 in all strings, 'CAA' = 2 bad user supplied string 'ABC' at -e line 1.

    Update:

    I only need the regular expression...

    TheA trick for matching overlapping patterns with a regex is in the capturing group nested within the positive look-ahead assertion of
     (?= ($user_subseq))

Re: Regex: finding all possible substrings
by sauoq (Abbot) on May 31, 2012 at 23:03 UTC

    A regular expression probably isn't the best way to do what you want... certainly not just one, at any rate.

    You are probably better iterating through the string you are searching, grabbing substrings the length of the strings you are searching for, and looking them up in a hash table.

    Or something similar.

    -sauoq
    "My two cents aren't worth a dime.";
Re: Regex: finding all possible substrings
by Kenosis (Priest) on Jun 01, 2012 at 01:50 UTC

    AnomalousMonk: I enjoy your regexes! Anyway, here's another solution (am reading "all possible substrings" as "all possible substring matches"):

    use Modern::Perl; my @userDefined = qw(AAA AAC ACA CAA); my $string = 'AAAAA AAACACA CAACAAA'; foreach my $element (@userDefined) { my ( $found, $elemLen ) = ( 0, length $element ); for ( my $i = 0 ; $i < length($string) - 2 ; $i++ ) { $found++ if substr( $string, $i, $elemLen ) eq $element; } say "$element: $found"; }

    Output:

    AAA: 5 AAC: 2 ACA: 3 CAA: 2
Re: Regex: finding all possible substrings
by tobyink (Canon) on Jun 01, 2012 at 14:48 UTC
    use strict; use Data::Dumper; use List::MapMulti 'iterator_multi'; my @userDefined = qw(AAA AAC ACA CAA); my $string = 'AAAAA AAACACA CAACAAA'; my %results = map { $_ => 0; } @userDefined; my $iter = iterator_multi [0..length($string)-1], [1..length($string)] +; while (my ($start, $end) = $iter->()) { next unless $end > $start; my $substr = substr $string, $start, ($end - $start); $results{$substr}++ if exists $results{$substr}; } print Dumper \%results;
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Regex: finding all possible substrings
by AnomalousMonk (Archbishop) on Jun 01, 2012 at 16:36 UTC

    Further to Re: Regex: finding all possible substrings: The approach shown in that reply will not handle overlapping matches of sub-strings that themselves contain other sub-strings, e.g., 'AAAA' containing 'AAA'.

    Here's another approach that will handle such cases. It uses the  (*FAIL) verb, one of the Special Backtracking Control Verbs of 5.10+. Whether a pure-regex approach is faster than a (nested) loop/substring comparison approach is another question; Benchmark-ing alone will tell the tale.

    >perl -wMstrict -le "my ($alts) = map qr{ $_ }xms, join q{ | }, qw(AAA AAC AAAAA AAAA ACA CCC CAA) ; print $alts; ;; my @strings = \qw(AAAAAT AAAACACA CAACAAA); my %counts = count_em($alts, @strings); ;; print qq{for count_em}; for my $k (sort keys %counts) { print qq{ in all strings: '$k' = $counts{$k}}; } ;; sub count_em { my $alts = shift; ;; local our %count; use re 'eval'; ${$_} =~ m{ ($alts) (?{ ++$count{$^N} }) (*FAIL) }xmsg for @_; return %count; } " (?^msx: AAA | AAC | AAAAA | AAAA | ACA | CCC | CAA ) for count_em in all strings: 'AAA' = 6 in all strings: 'AAAA' = 3 in all strings: 'AAAAA' = 1 in all strings: 'AAC' = 2 in all strings: 'ACA' = 3 in all strings: 'CAA' = 2