in reply to Regex: finding all possible substrings

If I understand correctly, you want to look at a substring of three characters as position '0' in your haystack, and see if it matches some permutation of the characters in (A,A,C). If so, do something. Then advance your position to 1, and check again. This is possible with a regex, but I think that it reads clearer (and is probably no less efficient) if you use substr to check each position within your haystack against a hash holding all permutations of your "needle" characters:

use Algorithm::Permute; my $needle_chars = 'AAC'; my $haystack = 'AAACACAA'; my $p = Algorithm::Permute->new( [ split //, $needle_chars ], 3 ); my %perms; while( my @perm = $p->next ) { $perms{ join '', @perm }++; } my $pos = 0; while( $pos + 3 < length $haystack ) { my $pos_chars = substr $haystack, $pos, 3; print $pos_chars, " found at $pos\n" if exists $perms{ $pos_chars }; $pos++; }

I don't know how big your set of needle characters really is. Perhaps instead of three it's really 100 characters, in which case holding onto all the permutations is impractical. But if the number of characters you are permuting is not too big this is an efficient solution.


Dave

Replies are listed 'Best First'.
Re^2: Regex: finding all possible substrings
by sauoq (Abbot) on Jun 01, 2012 at 00:01 UTC
    and see if it matches some permutation

    Of course, we're all kind of guessing here because that isn't exactly a well-written spec... but I think assuming he wants all permutations is a bit of a leap. I think it may also be a leap to assume he only wants to search for strings of a certain length. He just says that they are "user defined."

    I still think this is the right approach though, he just has to track the lengths of his needles and be sure to check substrings of each necessary length (so long as it doesn't go off the end of his haystack.)

    Something like this:

    #!/usr/bin/perl use warnings; use strict; my $haystack = shift; my %needles; undef @needles{@ARGV}; my @len = sort {$a<=>$b} keys %{{ map { (length,0) } keys %needles }}; my $pos = 0; my $hlen = length $haystack; while ($pos + $len[0] <= $hlen) { for my $L (@len) { last if $pos + $L > $hlen; my $substr = substr($haystack, $pos, $L); $needles{$substr}++ if exists $needles{$substr}; } $pos++; } use Data::Dumper; print Dumper \%needles;

    Update:

    $ ./973643.pl 'AAAAA AAACACA CAACAAA' AAA AAC ACA CAA $VAR1 = { 'AAC' => 2, 'ACA' => 3, 'CAA' => 2, 'AAA' => 5 };

    -sauoq
    "My two cents aren't worth a dime.";

      Of course, we're all kind of guessing here because that isn't exactly a well-written spec...

      That's a pretty good assessment, and I appreciate being granted a little latitude in my interpretation thereof. About the best I could do was guess, and then do a better job than the OP of documenting the criteria I came up with. :) When posts like this come up I have to make a decision whether to take a stab at trying to guess at a more refined specification, or to post a node seeking clarification (which often never comes), or to just let the question go and continue about the work that the question was distracting me from in the first place.

      Sometimes what tips the scales for me is if I find a little amusement in the diversion of coming up with a solution to the specification that I invented by venturing a guess. My hope is that it's also helpful.


      Dave

        or to just let the question go and continue about the work that the question was distrcting me from in the first place.

        Ha! . . . ++

        -sauoq
        "My two cents aren't worth a dime.";