Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

"Inverse" Pattern Matching

by hdb (Parson)
on Jul 04, 2013 at 20:38 UTC ( #1042519=perlquestion: print w/ replies, xml ) Need Help??
hdb has asked for the wisdom of the Perl Monks concerning the following question:

I am looking for a pattern in a long string where the long string contains some wildcards. Example:

Long string:1234561234123x561234123x61234x34561234 ^ ^ | ^ |\ ^\ Pattern: 123456 123456 123456 123456

The 'x' characters in the long string stands for one or two arbitrary characters in the pattern (ie ..? in regex language. The carets ^ indicate where the pattern would match: in the first place literally, in the second 'x' stands for '4', in the third 'x' is for '45' and in the last 'x' is for '12'.

I could loop over the long string, replace 'x' with '..?' and then match '123456' against the resulting pattern (actually several patterns as the length varies depending on '..?' matching one or two characters) but this would be time consuming for large strings (~millions of characters) and a shame not to fully utilize Perl's matching capabilities.

Is there a way to utilize Perl's regex engine in such a way that the wildcards are in the string rather than the pattern?

Remark: '123456' really are arbitrary characters, no order is implied.

Comment on "Inverse" Pattern Matching
Select or Download Code
Re: "Inverse" Pattern Matching
by Anonymous Monk on Jul 04, 2013 at 20:49 UTC

    How long are the pattern chunks, and could you scan the string piece by piece?

    For example, if you simplify your operation into searching for '123456' in a loop, then you could do a reverse regex and search for the string in the pattern (while changing 'x' to '..?')

      The length of the pattern can be hundreds. The main source of inefficiency when looping is IMHO that every 'x' constitutes a branch in the pattern. If there is only one 'x' it could stand for one character and one has to pick n-1 other characters to create a pattern of the same length as my short string. However, if 'x' stands for two characters one has to pick only n-2 characters. And if every 'x' stands for two choices, the one has 2**k variants if there are k 'x's.

Re: "Inverse" Pattern Matching
by LanX (Canon) on Jul 04, 2013 at 20:54 UTC
    we had a similar question not long ago: Using special characters in left part of a regex match?

    In short: It's not easy!!!

    I'm not sure how efficient it is but I might try to code all possibilities into a long or-clause

    /(12345x|1234x6|123x56|123x6| ...)/

    please note that from 5.10 on such expressions are efficiently trie-optimized.

    Another approach could be to match all "x" and surrounding n bytes to be further investigated...


    something like /(1 [2-5]{0,4} x [2-5]{0,4} 6)/gx in a while loop to narrow the set of possible matches.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      Thanks a lot. The long or-clause probably grows exponentially with the length of the pattern but if I combine it with your other proposal it might work.

Re: "Inverse" Pattern Matching
by choroba (Abbot) on Jul 04, 2013 at 22:32 UTC
    What about this? I keep a list of possible positions, I delete non-matching positions recursively character by character. If I make the string much longer (x 100_000), it still runs only for 8 seconds on my box.
    #!/usr/bin/perl use warnings; use strict; use feature qw(say); # 1 2 3 4 # 0123456789012345678901234567890123456789012345 # | | | | | | our $string = '1234561234123x561234123x61234x3456123412345x1'; our $pattern = '123456'; our $pattern_length = length $pattern; our %results; search(0, 0, [0 .. length($string) - 1]); say for sort { $a <=> $b } keys %results; sub search { my ($p_pos, $s_pos, $positions) = @_; return if $p_pos > $pattern_length; return unless @$positions; if ($p_pos == $pattern_length) { undef @results{@$positions}; return; } my $char = substr $pattern, $p_pos, 1; search($p_pos + 1, $s_pos + 1, [ grep { my $ch = substr($string, $s_pos + $_, 1); $char eq $ch or 'x' eq $ch; } @$positions ]); search($p_pos + 2, $s_pos + 1, [ grep substr($string, $s_pos + $_, 1) eq 'x', @$positions ]); }
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Thanks a lot. I am still working to get a proper understanding but your proposal has already found a match in my example that I have overlooked!

Re: "Inverse" Pattern Matching
by LanX (Canon) on Jul 05, 2013 at 12:18 UTC
    just another idea:

    try to find all wildcards plus leading and following characters


    so every real match must be of a form preceeded by 1..$1 and $2..6 in this special case.

    Since you have arbitrary characters w/o real order you will need to precompile a look-up hash to find the patterns needed for $1 and $2.

    e.g. 12a34a56 would lead to $pre{a}=qx/12a(34a)?/ and $post{a}=qx/(a34)?a56/ or better $post{a}=qx/(a34|a34a56)/!

    the check of the hashes can be done in a while loop against $PREMATCH and $POSTMATCH or in embeded perlcode with something like (?{..})

    No guaranties, I'm typing mobile and can't test ATM.

    But this should be quite efficient, even for cases where x is a wildcard for more than 2 characters.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1042519]
Approved by Perlbotics
Front-paged by BrowserUk
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2014-08-29 11:12 GMT
Find Nodes?
    Voting Booth?

    The best computer themed movie is:

    Results (280 votes), past polls