Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Growing strings in search

by b4swine (Pilgrim)
on Apr 15, 2020 at 13:15 UTC ( #11115569=perlquestion: print w/replies, xml ) Need Help??

b4swine has asked for the wisdom of the Perl Monks concerning the following question:

Dear Wise Ones,

I have a fixed complex regex. And I have hundreds of strings (at any one time) which are growing at the rate of about 50 chars per second. Around a few thousand character size, strings die, but others are reborn. I need to be continuously be checking the strings, all against the same regex, and raise an alarm if the regex succeeds against any string.

The problem is that after adding just one character, I need to start the whole search again. Is there any way to incrementally continue the search from where it last left off? This is a real-time problem, and checking after every 5 characters is not acceptable.

Replies are listed 'Best First'.
Re: Growing strings in search
by choroba (Archbishop) on Apr 15, 2020 at 13:56 UTC
    By "growing", do you mean you have variables that contain the strings and their content gets longer over time? Or do you need to read the strings from outside (a file, socket)?

    The answer also depends on the regex. For a simple regex, just starting where you left off might be enough (e.g. qr/X/), but even for a bit more complex regex qr/abc/, you can't just continue with the newly added characters, as the original string might have end in "ab" and the new character might be "c" that completes the match. For even more complex regexes like qr/a.*b/ you might need to return unknown number of characters to the left (i.e. to the last "a").

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      The strings are growing because data is coming in from the outside.

      The regexes are so complicated that they are software generated.

      The only other possibility seems to be to try to understand the regex and see if it can be expressed as something simpler, but that sounds too horrible to contemplate.

Re: Growing strings in search
by Corion (Pope) on Apr 15, 2020 at 19:31 UTC

    This sounds to me like a network packet scanner.

    If your problem is really like scanning network packets with regular expressions, you might want to take a look at Intels Hyperscan regex engine. It is not as powerful as Perls regular expressions, but it is geared towards matching lots of strings against lots of regular expressions in parallel.

Re: Growing strings in search
by haukex (Bishop) on Apr 15, 2020 at 15:31 UTC

    Is there any chance you can provide an SSCCE? In theory, it's possible to both check and change the regex engine's position via pos, but as choroba said, in lots of cases you'd actually have to backtrack an unknown amount, and that may not be feasible with pos - depending on the specific case.

Re: Growing strings in search
by roboticus (Chancellor) on Apr 15, 2020 at 18:44 UTC

    b4swine:

    You can't do it just using the perl regex engine. You might be able to turn your regex into a state machine, though, and then you could your strings through it character by character to find which one(s) match and which one(s) fail.

    Building a state machine from a regex isn't *terribly* difficult, but there are a couple sticking points. The main problem is that you'll have a tradeoff of speed vs. memory consumption. For example, if your regex has branches and multiple capture groups in it, then you may have a *lot* of bookkeeping to track possible starts/stops of capture groups. You may be able to considerably simplify things by making your state machine simpler so it can recognize when a match happens but without tracking your capture groups, and then you can turn the normal perl regex engine loose on the full string to get your capture groups or do a more refined match.

    I got distracted by your question and put together a simplified demo of the technique for fun. I'm still tuning it a little bit and I'll post it as a secondary reply later when I'm happy enough with it. If you post the regex you're using, I can tweak my demo accordingly.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      A particularly good tool for this is Ragel http://www.colm.net/open-source/ragel/

      It compiles character-fed state machines and is a more modern and flexible design than lex or flex.

      Sounds like you are describing "flex" or "lex". That makes this problem simple, just run a "flex" process against each individual string doing incremental reads -> problem solved :)

        tybalt89:

        Yes, it's very *much* like that.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

        I've never done anything significant with f?lex and it's been a long, long time since I've even read anything about its/their capabilities, but don't they have the same inherent "time-forward", if you will, orientation that regex compilers have?

        IOW, if you had a flex process that parsed backwards, it would be easy to use, but isn't the big trick to design such a process in the first place? Can you expand on the capabilities of flex in this regard?


        Give a man a fish:  <%-{-{-{-<

Re: Growing strings in search
by AnomalousMonk (Bishop) on Apr 15, 2020 at 19:27 UTC

    If one could run the match engine "backwards" and match a string in reverse, i.e., from end to beginning, that might solve your problem. Unfortunately, there's no  m//r modifier that I'm aware of.

    One might do a match against a reversed string:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $s = ''; my $r = ''; my @add = qw(a b c X d e Y f g h); ;; my $xeger = qr{ \A Y .* X }xms; ;; while (@add){ print qq{'$s' }, $r =~ $xeger ? 'MATCH' : \"no match\"; $r = qq{$add[0]$r}; $s .= shift @add; } " '' no match 'a' no match 'ab' no match 'abc' no match 'abcX' no match 'abcXd' no match 'abcXde' no match 'abcXdeY' MATCH 'abcXdeYf' no match 'abcXdeYfg' no match
    Incrementally building and maintaining a reversed string for each "forward" string might not be too expensive even for hundreds of strings of thousands of characters. Unfortunately, you now have the problem of finding some way to persuade the "software" to generate an arbitrary regex backwards, so that, e.g.,  X .* Y becomes  Y .* X (but that's just a small matter of programming, right? :).

    BTW: Is it possible for you to suppress re-compilation of the regex on every match, perhaps with the  m//o modifier? That might speed up matching in general and at least alleviate your problem.


    Give a man a fish:  <%-{-{-{-<

      One precompiles an expression using qr// to avoid repeated compilations later.
      my $RE = qr/foo.+?bar/; foreach (@str) { warn "Huzzah! $str" if m/$RE/; }

      --
      In Bob We Trust, All Others Bring Data.

        What I had in mind with my comment about the  m//o modifier is that b4swine might be receiving a complex regex and then interpolating it a la  m/...$re.../ during actual matching (as, indeed, you show in your example). This might be done if some additional match condition(s) were to be imposed on every received regex, e.g., adding a  \z anchor at the end to force end-of-string matching. Of course, I don't know that anything like this is actually happening; it was just a thought.

        For regexes interpolated into  m// matches, use of the  /o modifier seems beneficial in some cases, apparently depending on Perl version. (I know that work is being done continuously to refine and optimize regex matching.) For instance, under two older Perl versions that I have access to ATM (ActiveState 5.8.9 and Strawberry 5.14.4.1), benchmarking shows a significant benefit in the newer Perl for using  /o when interpolating into an  m// match (assuming this is a valid benchmark, and benchmarks can be tricky :).

        use strict; use warnings; use Benchmark qw(cmpthese); print "perl version $] \n"; my $rx = qr{X}xms; cmpthese(-1, { 'qr_bound' => sub { 'Y' =~ $rx ; }, 'qr_interp' => sub { 'Y' =~ / $rx / ; }, 'qr_interp_o' => sub { 'Y' =~ / $rx /o; }, 'm_empty' => sub { 'Y' =~ //; }, 'm_empty_o' => sub { 'Y' =~ //o; }, });
        Output:
        c:\@Work\Perl\monks\belg4mit>perl cmp_qr_compilation_1.pl perl version 5.008009 Rate qr_interp qr_interp_o qr_bound m_empty + m_empty_o qr_interp 2880477/s -- -9% -43% -55% + -67% qr_interp_o 3181791/s 10% -- -37% -51% + -64% qr_bound 5078627/s 76% 60% -- -21% + -42% m_empty 6452035/s 124% 103% 27% -- + -26% m_empty_o 8775008/s 205% 176% 73% 36% + -- c:\@Work\Perl\monks\belg4mit>perl cmp_qr_compilation_1.pl perl version 5.014004 Rate qr_bound qr_interp m_empty m_empty_o q +r_interp_o qr_bound 1004609/s -- -70% -76% -80% + -92% qr_interp 3378700/s 236% -- -20% -33% + -74% m_empty 4237856/s 322% 25% -- -16% + -68% m_empty_o 5068698/s 405% 50% 20% -- + -61% qr_interp_o 13135439/s 1208% 289% 210% 159% + --
        The  qr// comparisons are between failing matches because that's what I imagine would happen most often in the actual application. I just threw in some successful  // null matches out of curiosity.


        Give a man a fish:  <%-{-{-{-<

Re: Growing strings in search
by roboticus (Chancellor) on Apr 16, 2020 at 19:33 UTC

    b4swine:

    Here's the demo I put together. It still has shortcomings, but I got tired of playing with it, so I thought I'd post it here in case it would be interesting to you or another perlmonk at some time. It has a few shortcomings due to me wanting to keep it a short demo, and I'm not happy about how the abstractions are factored, but I've got other things I want to do, so I'm leaving it as-is for you or anyone to play with. It does, however, demonstrate what I described in my earlier note.

    Code:

    Sample output:

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11115569]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2020-10-21 21:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (223 votes). Check out past polls.

    Notices?