Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: Find Prefix if regex didn't match

by demoralizer (Acolyte)
on Oct 31, 2012 at 12:06 UTC ( #1001656=note: print w/ replies, xml ) Need Help??


in reply to Re: Find Prefix if regex didn't match
in thread Find Prefix if regex didn't match

Thanks for your fast answer!

some good aproaches but no hit...

pos() doesn't work because if the string only contains a prefix of the given expression (what will be the hot case I'm looking for) I will get undef and not the position I need or do I overlook sth. here?

My "window" can become as large as it likes, doesn't matter, but I have to ensure that no matching will be overseen, that's quite important! It's just to make the searching faster and if I know that the first n characters can be thrown away because they can never be part of a matching would do the job. Any alternative suggestion?


Comment on Re^2: Find Prefix if regex didn't match
Re^3: Find Prefix if regex didn't match
by Anonymous Monk on Oct 31, 2012 at 12:59 UTC

    pos() doesn't work because if the string only contains a prefix of the given expression (what will be the hot case I'm looking for) I will get undef and not the position I need or do I overlook sth. here?

    Once again in english, please?

    Any alternative suggestion?

    Not really. To remove a prefix requires a regex match. And then you do real matching. I doubt there is any savings to be had by matching twice ... or actually cutting the string, even with pos

    my $search="AB.*Z"; my $string="WWWA"; my $search_prefix = $1 if $search =~ /^(\w+)/g; warn $search_prefix; my $prefix_offset = index ( $string, $search_prefix ); substr $string, 0, $prefix_offset , ''; warn $string; $string = "WWWADBBBABC"; $prefix_offset = index ( $string, $search_prefix );; warn $string; substr $string, 0, $prefix_offset , ''; warn $string; $string = "WWWADBBBABC"; pos( $string ) = $prefix_offset; warn pos( $string ); ## next match m//g starts at offset __END__ AB at jank line 5. A at jank line 8. WWWADBBBABC at jank line 11. ABC at jank line 13. 8 at jank line 16.

    For some idea why I think so , maybe , see Why does global match run faster than none global?, Multiple Regex evaluations or one big one?

      I do not believe you are gaining anything by the strategy you suggest in your problem.

      The only part of the text you can safely throw away is any part which does not match any leading "fixed" characters in the regex, less the length of the "fixed" character string.

      For example, looking for AB.*Z will only be able to (eat) throw away text until it encounters the first AB in the text, as from then on greedy matching means it must acquire all text until it encounters a Z, so even if the next Z is several million characters from the AB, the program must keep all of it, and run the search from that AB.

      In summary, if you are finding searches slow, then you should perhaps be looking at doing the search less often, perhaps as a scheduled task or when the text grows by a set amount.

        Exactly this is what I'm trying to do, "throwing away text until it encounters the first AB...".

        Reason is that I have a timeout up to when the searched string should be found otherwise it's an error. Reaction time should be as short as possible so I have to scan as often as I receive sth. therefore doing the search less often will not work (see the short example I posted I do the search only as often as absolutely necessary but not more) and searching after the text has been grown by a certain amount of characters is also not practicable because it can happen that I receive a short package containing the expression but the necessary receive size has not been reached yet. So another timeout would be necessary for such cases what enlarges the reaction time more than necessary.

        In most cases the search string I'm looking for is only contained once and most of the time the scanned text even doesn't contain a prefix of it but it's possible that it takes two TCP packages to receive the text (e.g. I'm looking for "ABC" and receive "XXA" in the first and "BCYYY" in the second package) and it's necessary that I don't miss any pattern so the only optimization possibility I see is cutting the "head" away. Given benchmark shows me that this works! That is because the whole text can become quite large and can be received within several packages, isn't it`?

      aah now I got your idea, not bad but that's too much simplified ;)

      Extracting a \w+ prefix from the expression e.g. doesn't work with stuff like this:
      my $search="(AB)+.*Z";

      The problem is that the search string is given and therefore I have no influence on it. Maybe you have been irritated by my ".*ABC" example but what I ment here was that in such a case there is no unmatchable prefix that can be cut away.
        forget about cutting, cutting will never speed anything up

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1001656]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2014-07-24 09:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (158 votes), past polls