Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^2: In search of an efficient query abstractor

by xaprb (Scribe)
on Dec 07, 2008 at 15:38 UTC ( #728730=note: print w/ replies, xml ) Need Help??


in reply to Re: In search of an efficient query abstractor
in thread In search of an efficient query abstractor

As I understand it, that would still do backtracking, right? Try to match the first one; if it fails, backtrack and try to match the alternative.

Another challenge here is that the initial characters are optional. Maybe if I write things out fully it becomes easier to optimize. I'll meditate on that. In that case, it might look like

my $number = qr{ \b\d+ | \b\d+\.\d+\ | \b\.\d+ .... }x;

But it already looks like I'm introducing backtracking again. My gut feeling is that I can write a state machine for this that doesn't need to do backtracking. Hmmm :-\


Comment on Re^2: In search of an efficient query abstractor
Download Code
Re^3: In search of an efficient query abstractor
by Corion (Pope) on Dec 07, 2008 at 15:48 UTC

    Of course you can just do what lex does, but if you use a good character class as the first character and thus eliminate backtracking back over the first character (for example, eliminate the optional part of [-+]?\d+ by making it into [-+\d]\d*) you can use the atomic match operator ?> to eliminate lots of backtracking, as you know that your SQL is well-formed, or rather, it's of little concern to you if the SQL is not well-formed.

    Ideally, you have a disjunct set of character classes that start the separate matches. Likely, the disjunct set would be [-+\d] for numbers and ' for strings. If you want to be more careful, you can treat 0[bx]\w+ a bit more discerning, but I wouldn't bother and simply assume instead that the SQL is well-formed.

Re^3: In search of an efficient query abstractor
by gone2015 (Deacon) on Dec 07, 2008 at 20:44 UTC

    Can you identify token separators, and break the input up into stuff which isn't a problem, and stuff which might be ?

    Starting by tidying up:

    $query =~ s/\s+/ /g ; # that's the whitespace $query =~ s/\A\s// ; # strip leading $query =~ s/\s\Z// ; # strip trailing $query = lc($query) ; # all lower case $query =~ s/(["'])((?:\\\1|\1\1|.)*?)\1/mash_s($1, $2)/eg ; # Eliminate separators from quoted string +s sub mash_s { my ($q, $s) = @_ ; $s =~ tr/0-9a-z/\\/c ; return $q.$s.$q ; } ;
    which, in particular, leaves all "..." or '...' strings containing only [0-9a-z\\]. Means that can then attack anything between separator characters:
    $query =~ s/([^ !#\$%()*,\/:;<=>?\@[\]^{|}~]+)/mash_l($1)/eg ; sub mash_l { my ($s) = @_ ; return $s if $s =~ /^(?:[a-z]+|\+|\-)$/ ; return 'N' if $s =~ /^[+-]?(?: (?:\d+(?:\.\d*)? | \.\d+) (?:e[+-]\d+ +)? |(?:0(?: x[0-9a-f]+ |b[01]+ ) ) |x'[0-9a-f]+' |b'[01]+' )$/x ; return 'S' if $s =~ /^(["']).*?\1$/ ; return $s ; } ;
    Sadly, what this shows most clearly is that distinguishing unary and binary '+' and '-' is tricky. The above will cope with 12 + -17 and 12*-5, but will fail on 12+13 or 12 +-13 and so on...

    ...using a parser, where somebody else has done all the hard work, looks like a good trick !

      About unary/binary: I had the same thought while sketching out a state machine. Obviously you have to keep some context to know which is which. I'm thinking that brute-forcing and just treating such an expression as a number is acceptable for this log analysis. I mean,

      select 5 + 1; select 6; select 8 + 1+-5;

      From the point of view of log analysis, those statements are all similar. Selecting a number is selecting a number, mush them all together and report on them in aggregate.

      Of course that's not strictly true. You might have a silly application that constantly does "select 5" and not so frequently does "select 5 + 5" and you want to be able to distinguish them so you can find the offending code that's causing the first query. But that's a corner case.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://728730]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2015-07-04 21:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (60 votes), past polls