Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Re^3: Find Prefix if regex didn't match

by Anonymous Monk
on Oct 31, 2012 at 12:59 UTC ( #1001657=note: print w/replies, xml ) Need Help??

in reply to Re^2: Find Prefix if regex didn't match
in thread Find Prefix if regex didn't match

pos() doesn't work because if the string only contains a prefix of the given expression (what will be the hot case I'm looking for) I will get undef and not the position I need or do I overlook sth. here?

Once again in english, please?

Any alternative suggestion?

Not really. To remove a prefix requires a regex match. And then you do real matching. I doubt there is any savings to be had by matching twice ... or actually cutting the string, even with pos

my $search="AB.*Z"; my $string="WWWA"; my $search_prefix = $1 if $search =~ /^(\w+)/g; warn $search_prefix; my $prefix_offset = index ( $string, $search_prefix ); substr $string, 0, $prefix_offset , ''; warn $string; $string = "WWWADBBBABC"; $prefix_offset = index ( $string, $search_prefix );; warn $string; substr $string, 0, $prefix_offset , ''; warn $string; $string = "WWWADBBBABC"; pos( $string ) = $prefix_offset; warn pos( $string ); ## next match m//g starts at offset __END__ AB at jank line 5. A at jank line 8. WWWADBBBABC at jank line 11. ABC at jank line 13. 8 at jank line 16.

For some idea why I think so , maybe , see Why does global match run faster than none global?, Multiple Regex evaluations or one big one?

Replies are listed 'Best First'.
Re^4: Find Prefix if regex didn't match
by space_monk (Chaplain) on Oct 31, 2012 at 13:27 UTC

    I do not believe you are gaining anything by the strategy you suggest in your problem.

    The only part of the text you can safely throw away is any part which does not match any leading "fixed" characters in the regex, less the length of the "fixed" character string.

    For example, looking for AB.*Z will only be able to (eat) throw away text until it encounters the first AB in the text, as from then on greedy matching means it must acquire all text until it encounters a Z, so even if the next Z is several million characters from the AB, the program must keep all of it, and run the search from that AB.

    In summary, if you are finding searches slow, then you should perhaps be looking at doing the search less often, perhaps as a scheduled task or when the text grows by a set amount.

      Exactly this is what I'm trying to do, "throwing away text until it encounters the first AB...".

      Reason is that I have a timeout up to when the searched string should be found otherwise it's an error. Reaction time should be as short as possible so I have to scan as often as I receive sth. therefore doing the search less often will not work (see the short example I posted I do the search only as often as absolutely necessary but not more) and searching after the text has been grown by a certain amount of characters is also not practicable because it can happen that I receive a short package containing the expression but the necessary receive size has not been reached yet. So another timeout would be necessary for such cases what enlarges the reaction time more than necessary.

      In most cases the search string I'm looking for is only contained once and most of the time the scanned text even doesn't contain a prefix of it but it's possible that it takes two TCP packages to receive the text (e.g. I'm looking for "ABC" and receive "XXA" in the first and "BCYYY" in the second package) and it's necessary that I don't miss any pattern so the only optimization possibility I see is cutting the "head" away. Given benchmark shows me that this works! That is because the whole text can become quite large and can be received within several packages, isn't it`?

Re^4: Find Prefix if regex didn't match
by demoralizer (Sexton) on Oct 31, 2012 at 14:05 UTC
    aah now I got your idea, not bad but that's too much simplified ;)

    Extracting a \w+ prefix from the expression e.g. doesn't work with stuff like this:
    my $search="(AB)+.*Z";

    The problem is that the search string is given and therefore I have no influence on it. Maybe you have been irritated by my ".*ABC" example but what I ment here was that in such a case there is no unmatchable prefix that can be cut away.
      forget about cutting, cutting will never speed anything up
        In my case I see that cutting works...

        May be some more explanation is necessary: I'm reading (non-blocking) to a socket where a process running on another machine sends logging stuff to me. Sometimes I get many single characters sometimes I get large blocks. I don't know when I will receive the next package. A user can give me a regular expression I have to watch for and a timeout value. If I can match I return immediately if not I return after given timeout with an error.

        Using expressions like "ABC" works quite fast and don't cause any problems with the timeout but not expressions like "AB.*Z". They only work as long as I get a few characters wihtin a few packages but not with thousands of them.

        If I check the received string length and cut it after e.g. it gets larger than 20 characters I have no timeout problems any more and eth. works fine.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1001657]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2018-01-21 08:53 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (227 votes). Check out past polls.