Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^3: In search of an efficient query abstractor

by mpeg4codec (Pilgrim)
on Dec 07, 2008 at 21:14 UTC ( #728794=note: print w/ replies, xml ) Need Help??


in reply to Re^2: In search of an efficient query abstractor
in thread In search of an efficient query abstractor

The main issue you're running into is that SQL is a context-free language. That is, it belongs to a class of languages whose expressive power is greater than regular languages and thus cannot be parsed using regular expressions. (The "regular" in regular expressions refers to regular languages.) You'll need to use a recursive descent parser to properly handle SQL statements.

The best place to begin is to find a grammar for the specific dialect of SQL you're using. You mention that you're parsing MySQL logs, so you may be in luck. The MySQL docs include a grammar for each SQL command it understands. For instance, on the doc page for select, the gray box in fixed width font is the grammar.

I do not know the specifics of RecDescent (never used it), but I think you should be able to give it the grammar and a SQL statement and it will give you a parse tree. That greatly simplifies your job, because now you only have to recognize patterns in the tree. In fact, productions of the grammar using specific rules are probably the exact "prototypes" you're looking for.

A note on efficiency: recursive descent parser isn't terrible efficient at O(n3) in the size of the input. If speed's the name of the game and you have a lot of time to spend on this, you can try building an LL(1) grammar for SQL which can be parsed in O(n) using a custom parser. Not recommended for the faint of heart.

Further note: I'm not the anonymous monk in the gp, just a CS student with a passion for theory.


Comment on Re^3: In search of an efficient query abstractor
Re^4: In search of an efficient query abstractor
by xaprb (Scribe) on Dec 07, 2008 at 21:21 UTC

    Understood. I'm an ex-CS student whose theory is now a distant memory :) But I don't think I need to actually parse SQL to accomplish this. Abstracting away strings and numbers is a much easier problem than parsing a language, and I'm pretty sure it's going to be faster (perhaps not in Perl, though).

    Alas, the MySQL grammar is not actually the same thing as the gray box on the manual pages. The real grammar is in sql_yacc.yy which is something from a horror film.

      yacc requires the grammar to be LALR, which I agree belongs in the ninth circle of hell. The grammar on the MySQL pages is an unspecified sort of context free grammar and I believe RecDescent supports that.

      This is pretty similar to the HTML parsing debate (which never seems to end). You want to get some data out of an HTML page? Go with regex. Want to do anything related to the structure of the HTML and actually parse it? Definitely go with one of the parser modules.

      Analogously, since you're trying to poke around the structure of SQL statements, my recommendation still stands. OTOH, I can understand resistance with regard to picking up RecDescent for a relatively straightforward task such as this one.

      Best of luck!

        I would recommend strongly against Parse::RecDescent for this. That was written before the /g modifier existed in Perl and so every time it matches a token it makes a copy of everything that comes after the token. On even a fairly small data set this can take a prohibitive amount of time and memory.

        Changing that would entail rewriting the whole module. TheDamian had plans to do this, but I don't know if it ever happened. He did tell me that said rewrite was going to have to be incompatible with the original in some ways.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://728794]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (12)
As of 2014-07-25 13:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (171 votes), past polls