Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

The main issue you're running into is that SQL is a context-free language. That is, it belongs to a class of languages whose expressive power is greater than regular languages and thus cannot be parsed using regular expressions. (The "regular" in regular expressions refers to regular languages.) You'll need to use a recursive descent parser to properly handle SQL statements.

The best place to begin is to find a grammar for the specific dialect of SQL you're using. You mention that you're parsing MySQL logs, so you may be in luck. The MySQL docs include a grammar for each SQL command it understands. For instance, on the doc page for select, the gray box in fixed width font is the grammar.

I do not know the specifics of RecDescent (never used it), but I think you should be able to give it the grammar and a SQL statement and it will give you a parse tree. That greatly simplifies your job, because now you only have to recognize patterns in the tree. In fact, productions of the grammar using specific rules are probably the exact "prototypes" you're looking for.

A note on efficiency: recursive descent parser isn't terrible efficient at O(n3) in the size of the input. If speed's the name of the game and you have a lot of time to spend on this, you can try building an LL(1) grammar for SQL which can be parsed in O(n) using a custom parser. Not recommended for the faint of heart.

Further note: I'm not the anonymous monk in the gp, just a CS student with a passion for theory.

In reply to Re^3: In search of an efficient query abstractor by mpeg4codec
in thread In search of an efficient query abstractor by xaprb

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2021-06-13 12:20 GMT
Find Nodes?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)

    Results (55 votes). Check out past polls.