"Inverse" Pattern Matching

hdb has asked for the wisdom of the Perl Monks concerning the following question:

I am looking for a pattern in a long string where the long string contains some wildcards. Example:

Long string:1234561234123x561234123x61234x34561234
            ^         ^  |      ^  |\    ^\
Pattern:    123456    123456    123456   123456
[download]

The 'x' characters in the long string stands for one or two arbitrary characters in the pattern (ie ..? in regex language. The carets ^ indicate where the pattern would match: in the first place literally, in the second 'x' stands for '4', in the third 'x' is for '45' and in the last 'x' is for '12'.

I could loop over the long string, replace 'x' with '..?' and then match '123456' against the resulting pattern (actually several patterns as the length varies depending on '..?' matching one or two characters) but this would be time consuming for large strings (~millions of characters) and a shame not to fully utilize Perl's matching capabilities.

Is there a way to utilize Perl's regex engine in such a way that the wildcards are in the string rather than the pattern?

Remark: '123456' really are arbitrary characters, no order is implied.

Comment on "Inverse" Pattern Matching Select or Download Code

Replies are listed 'Best First'.
Re: "Inverse" Pattern Matching by LanX (Saint) on Jul 04, 2013 at 20:54 UTC
we had a similar question not long ago: Using special characters in left part of a regex match? In short: It's not easy!!! I'm not sure how efficient it is but I might try to code all possibilities into a long or-clause `/(12345x\|1234x6\|123x56\|123x6\| ...)/` please note that from 5.10 on such expressions are efficiently trie-optimized. Another approach could be to match all "x" and surrounding n bytes to be further investigated... edit something like `/(1 [2-5]{0,4} x [2-5]{0,4} 6)/gx` in a while loop to narrow the set of possible matches. Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re^2: "Inverse" Pattern Matching by hdb (Monsignor) on Jul 05, 2013 at 12:22 UTC
Thanks a lot. The long or-clause probably grows exponentially with the length of the pattern but if I combine it with your other proposal it might work.	[reply] [d/l]
Re: "Inverse" Pattern Matching by choroba (Cardinal) on Jul 04, 2013 at 22:32 UTC
What about this? I keep a list of possible positions, I delete non-matching positions recursively character by character. If I make the string much longer (`x 100_000`), it still runs only for 8 seconds on my box. #!/usr/bin/perl use warnings; use strict; use feature qw(say); # 1 2 3 4 # 0123456789012345678901234567890123456789012345 # \| \| \| \| \| \| our $string = '1234561234123x561234123x61234x3456123412345x1'; our $pattern = '123456'; our $pattern_length = length $pattern; our %results; search(0, 0, [0 .. length($string) - 1]); say for sort { $a <=> $b } keys %results; sub search { my ($p_pos, $s_pos, $positions) = @_; return if $p_pos > $pattern_length; return unless @$positions; if ($p_pos == $pattern_length) { undef @results{@$positions}; return; } my $char = substr $pattern, $p_pos, 1; search($p_pos + 1, $s_pos + 1, [ grep { my $ch = substr($string, $s_pos + $_, 1); $char eq $ch or 'x' eq $ch; } @$positions ]); search($p_pos + 2, $s_pos + 1, [ grep substr($string, $s_pos + $_, 1) eq 'x', @$positions ]); } [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^2: "Inverse" Pattern Matching by hdb (Monsignor) on Jul 05, 2013 at 12:21 UTC
Thanks a lot. I am still working to get a proper understanding but your proposal has already found a match in my example that I have overlooked!	[reply] [d/l]
Re: "Inverse" Pattern Matching by Anonymous Monk on Jul 04, 2013 at 20:49 UTC
How long are the pattern chunks, and could you scan the string piece by piece? For example, if you simplify your operation into searching for '123456' in a loop, then you could do a reverse regex and search for the string in the pattern (while changing 'x' to '..?')	[reply]
Re^2: "Inverse" Pattern Matching by hdb (Monsignor) on Jul 04, 2013 at 20:59 UTC
The length of the pattern can be hundreds. The main source of inefficiency when looping is IMHO that every 'x' constitutes a branch in the pattern. If there is only one 'x' it could stand for one character and one has to pick n-1 other characters to create a pattern of the same length as my short string. However, if 'x' stands for two characters one has to pick only n-2 characters. And if every 'x' stands for two choices, the one has 2**k variants if there are k 'x's.	[reply] [d/l]
Re: "Inverse" Pattern Matching by LanX (Saint) on Jul 05, 2013 at 12:18 UTC
just another idea: try to find all wildcards plus leading and following characters `/(.)x(.)/` so every real match must be of a form preceeded by `1..$1` and `$2..6` in this special case. Since you have arbitrary characters w/o real order you will need to precompile a look-up hash to find the patterns needed for $1 and $2. e.g. `12a34a56` would lead to `$pre{a}=qx/12a(34a)?/` and `$post{a}=qx/(a34)?a56/` or better `$post{a}=qx/(a34\|a34a56)/`! the check of the hashes can be done in a while loop against $PREMATCH and $POSTMATCH or in embeded perlcode with something like `(?{..})` No guaranties, I'm typing mobile and can't test ATM. But this should be quite efficient, even for cases where x is a wildcard for more than 2 characters. Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l] [select]


more useful options
	PerlMonks

"Inverse" Pattern Matching

edit