Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re: In search of an efficient query abstractor

by Corion (Pope)
on Dec 07, 2008 at 15:04 UTC ( #728720=note: print w/replies, xml ) Need Help??

in reply to In search of an efficient query abstractor

I think part of the inefficiency is that you're running multiple regular expressions over your code. You could roll all these regular expressions into one regular expression, at least that's easy as long as the first character of any candidate match is different between all regular expressions:

my $re_num = qr{ [+-]? (?: \d+ (?:[.]\d*)? |[.]\d+ ) (?:e[+-]?\d+)? \b }x; # Float/real into N my $re_hexbin = qr{\b0(?:x[0-9a-f]+|b[01]+)\b}; # Hex/bin into N my $num_combined = qr{ $re_num | $re_hexbin }; $query =~ s/$num_combined/N/g;

If you want to replace both, strings and numbers in one go, you'll have to combine the regular expressions as above but capture into (say) $1 for numbers and into $2 for strings and then replace with S or N whether $1 or $2 is defined.

Replies are listed 'Best First'.
Re^2: In search of an efficient query abstractor
by xaprb (Scribe) on Dec 07, 2008 at 15:38 UTC

    As I understand it, that would still do backtracking, right? Try to match the first one; if it fails, backtrack and try to match the alternative.

    Another challenge here is that the initial characters are optional. Maybe if I write things out fully it becomes easier to optimize. I'll meditate on that. In that case, it might look like

    my $number = qr{ \b\d+ | \b\d+\.\d+\ | \b\.\d+ .... }x;

    But it already looks like I'm introducing backtracking again. My gut feeling is that I can write a state machine for this that doesn't need to do backtracking. Hmmm :-\

      Of course you can just do what lex does, but if you use a good character class as the first character and thus eliminate backtracking back over the first character (for example, eliminate the optional part of [-+]?\d+ by making it into [-+\d]\d*) you can use the atomic match operator ?> to eliminate lots of backtracking, as you know that your SQL is well-formed, or rather, it's of little concern to you if the SQL is not well-formed.

      Ideally, you have a disjunct set of character classes that start the separate matches. Likely, the disjunct set would be [-+\d] for numbers and ' for strings. If you want to be more careful, you can treat 0[bx]\w+ a bit more discerning, but I wouldn't bother and simply assume instead that the SQL is well-formed.

      Can you identify token separators, and break the input up into stuff which isn't a problem, and stuff which might be ?

      Starting by tidying up:

      $query =~ s/\s+/ /g ; # that's the whitespace $query =~ s/\A\s// ; # strip leading $query =~ s/\s\Z// ; # strip trailing $query = lc($query) ; # all lower case $query =~ s/(["'])((?:\\\1|\1\1|.)*?)\1/mash_s($1, $2)/eg ; # Eliminate separators from quoted string +s sub mash_s { my ($q, $s) = @_ ; $s =~ tr/0-9a-z/\\/c ; return $q.$s.$q ; } ;
      which, in particular, leaves all "..." or '...' strings containing only [0-9a-z\\]. Means that can then attack anything between separator characters:
      $query =~ s/([^ !#\$%()*,\/:;<=>?\@[\]^{|}~]+)/mash_l($1)/eg ; sub mash_l { my ($s) = @_ ; return $s if $s =~ /^(?:[a-z]+|\+|\-)$/ ; return 'N' if $s =~ /^[+-]?(?: (?:\d+(?:\.\d*)? | \.\d+) (?:e[+-]\d+ +)? |(?:0(?: x[0-9a-f]+ |b[01]+ ) ) |x'[0-9a-f]+' |b'[01]+' )$/x ; return 'S' if $s =~ /^(["']).*?\1$/ ; return $s ; } ;
      Sadly, what this shows most clearly is that distinguishing unary and binary '+' and '-' is tricky. The above will cope with 12 + -17 and 12*-5, but will fail on 12+13 or 12 +-13 and so on...

      ...using a parser, where somebody else has done all the hard work, looks like a good trick !

        About unary/binary: I had the same thought while sketching out a state machine. Obviously you have to keep some context to know which is which. I'm thinking that brute-forcing and just treating such an expression as a number is acceptable for this log analysis. I mean,

        select 5 + 1; select 6; select 8 + 1+-5;

        From the point of view of log analysis, those statements are all similar. Selecting a number is selecting a number, mush them all together and report on them in aggregate.

        Of course that's not strictly true. You might have a silly application that constantly does "select 5" and not so frequently does "select 5 + 5" and you want to be able to distinguish them so you can find the offending code that's causing the first query. But that's a corner case.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://728720]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2021-06-13 02:11 GMT
Find Nodes?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)

    Results (54 votes). Check out past polls.