Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^3: Find Prefix if regex didn't match

by Anonymous Monk
on Oct 31, 2012 at 12:59 UTC ( #1001657=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Find Prefix if regex didn't match
in thread Find Prefix if regex didn't match

pos() doesn't work because if the string only contains a prefix of the given expression (what will be the hot case I'm looking for) I will get undef and not the position I need or do I overlook sth. here?

Once again in english, please?

Any alternative suggestion?

Not really. To remove a prefix requires a regex match. And then you do real matching. I doubt there is any savings to be had by matching twice ... or actually cutting the string, even with pos

my $search="AB.*Z"; my $string="WWWA"; my $search_prefix = $1 if $search =~ /^(\w+)/g; warn $search_prefix; my $prefix_offset = index ( $string, $search_prefix ); substr $string, 0, $prefix_offset , ''; warn $string; $string = "WWWADBBBABC"; $prefix_offset = index ( $string, $search_prefix );; warn $string; substr $string, 0, $prefix_offset , ''; warn $string; $string = "WWWADBBBABC"; pos( $string ) = $prefix_offset; warn pos( $string ); ## next match m//g starts at offset __END__ AB at jank line 5. A at jank line 8. WWWADBBBABC at jank line 11. ABC at jank line 13. 8 at jank line 16.

For some idea why I think so , maybe , see Why does global match run faster than none global?, Multiple Regex evaluations or one big one?


Comment on Re^3: Find Prefix if regex didn't match
Download Code
Re^4: Find Prefix if regex didn't match
by space_monk (Chaplain) on Oct 31, 2012 at 13:27 UTC

    I do not believe you are gaining anything by the strategy you suggest in your problem.

    The only part of the text you can safely throw away is any part which does not match any leading "fixed" characters in the regex, less the length of the "fixed" character string.

    For example, looking for AB.*Z will only be able to (eat) throw away text until it encounters the first AB in the text, as from then on greedy matching means it must acquire all text until it encounters a Z, so even if the next Z is several million characters from the AB, the program must keep all of it, and run the search from that AB.

    In summary, if you are finding searches slow, then you should perhaps be looking at doing the search less often, perhaps as a scheduled task or when the text grows by a set amount.

      Exactly this is what I'm trying to do, "throwing away text until it encounters the first AB...".

      Reason is that I have a timeout up to when the searched string should be found otherwise it's an error. Reaction time should be as short as possible so I have to scan as often as I receive sth. therefore doing the search less often will not work (see the short example I posted I do the search only as often as absolutely necessary but not more) and searching after the text has been grown by a certain amount of characters is also not practicable because it can happen that I receive a short package containing the expression but the necessary receive size has not been reached yet. So another timeout would be necessary for such cases what enlarges the reaction time more than necessary.

      In most cases the search string I'm looking for is only contained once and most of the time the scanned text even doesn't contain a prefix of it but it's possible that it takes two TCP packages to receive the text (e.g. I'm looking for "ABC" and receive "XXA" in the first and "BCYYY" in the second package) and it's necessary that I don't miss any pattern so the only optimization possibility I see is cutting the "head" away. Given benchmark shows me that this works! That is because the whole text can become quite large and can be received within several packages, isn't it`?

Re^4: Find Prefix if regex didn't match
by demoralizer (Acolyte) on Oct 31, 2012 at 14:05 UTC
    aah now I got your idea, not bad but that's too much simplified ;)

    Extracting a \w+ prefix from the expression e.g. doesn't work with stuff like this:
    my $search="(AB)+.*Z";

    The problem is that the search string is given and therefore I have no influence on it. Maybe you have been irritated by my ".*ABC" example but what I ment here was that in such a case there is no unmatchable prefix that can be cut away.
      forget about cutting, cutting will never speed anything up
        In my case I see that cutting works...

        May be some more explanation is necessary: I'm reading (non-blocking) to a socket where a process running on another machine sends logging stuff to me. Sometimes I get many single characters sometimes I get large blocks. I don't know when I will receive the next package. A user can give me a regular expression I have to watch for and a timeout value. If I can match I return immediately if not I return after given timeout with an error.

        Using expressions like "ABC" works quite fast and don't cause any problems with the timeout but not expressions like "AB.*Z". They only work as long as I get a few characters wihtin a few packages but not with thousands of them.

        If I check the received string length and cut it after e.g. it gets larger than 20 characters I have no timeout problems any more and eth. works fine.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1001657]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (10)
As of 2014-12-22 15:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (119 votes), past polls