Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Regex: Identifying comments

by zuma53 (Beadle)
on Aug 29, 2012 at 14:20 UTC ( #990479=perlquestion: print w/ replies, xml ) Need Help??
zuma53 has asked for the wisdom of the Perl Monks concerning the following question:

Hi--

I am working on some SQL files where rest-of-line comments are denoted by '--'. Everything to the right of this tag is considered a comment. However, all this gets gummed up by the presence of single quotes. Embedded '--''s in strings are not comment anchors.

Example:

select 'text' from foo --This is a comment select '--Not a valid comment' from foo --But this is select '--This is not a valid comment' from foo select '--Not this' + '--either' from foo

What I'd like is to strip out the comments. I can do this programmatically, though tedious, by counting quotations, but is there a way to do this in a regex?

Thanks!

Comment on Regex: Identifying comments
Download Code
Re: Regex: Identifying comments
by choroba (Abbot) on Aug 29, 2012 at 14:33 UTC
    This should work for simple cases (no newlines inside single quotes, no single quotes in comments):
    #!/usr/bin/perl use warnings; use strict; while (<DATA>) { my $last = (split /'/)[-1]; print $1 if $last =~ /(--.*)/; } __DATA__ select 'text' from foo --This is a comment select '--Not a valid comment' from foo --But this is select q from z -- as is this select '--This is not a valid comment' from foo select '--Not this' + '--either' from foo
    Updated.
    To handle single quotes in comments, you might need to change the script to the following:
    #!/usr/bin/perl use warnings; use strict; while (<DATA>) { my @items = split /'/; until (not @items or $items[0] =~ s/.*?--/--/) { shift @items for 1, 2; # Remove the quoted part, too. } print join "'", @items; } __DATA__ select 'text' from foo --This is a comment select '--Not a valid comment' from foo --But this is select q from z -- as is this select '--This is not a valid comment' from foo select '--Not this' + '--either' from foo select 'qaws' + make from "a" -- comment with 'a' quote select 'a' from 'b' with 'c' -- comment with 'a --' comment
    To get the code instead of the comments, just invert the logic:
    while (<DATA>) { chomp; my @code; my @items = split /'/; until (not @items or $items[0] =~ s/--.*//) { @items and push @code, shift @items for 1, 2; } print join("'", @code), @items ? (@code ? "'" : q() ) . "$items[0] +" : q(), "\n"; }
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Regex: Identifying comments
by Athanasius (Monsignor) on Aug 29, 2012 at 14:40 UTC

    Here is another approach:

    #! perl use strict; use warnings; my @selects = ("select 'text' from foo --This is a comment", "select '--Not a valid comment' from foo --But t +his is", "select '--This is not a valid comment' from foo", "select '--Not this' + '--either' from foo"); for my $select (@selects) { my ($comment) = $select =~ / ( -- [^']+ ) $ /x; print $comment, "\n" if $comment; }

    Output:

    --This is a comment --But this is

    Update: The above code does not handle single quotes in comments. The following is one way to do this (indebted to choroba’s solution, above):

    #! perl use strict; use warnings; my @selects = ("select 'text' from foo --This is a comment", "select '--Not a valid comment' from foo --But t +his is", "select '--This is not a valid comment' from foo", "select '--Not this' + '--either' from foo", "select stuff from that -- a 'useful comment' goes here +?", "select 'qaws' + make from \"a\" -- comment with 'a' qu +ote", "select 'a' from 'b' with 'c' -- comment with 'a --' co +mment"); for my $select (@selects) { my $stripped = $select =~ s/ ( ' .*? ' ) /'_' x length $1/egrx; if ($stripped =~ /--/) { my $i = length($select) - (index reverse($stripped), '--') - 2 +; print substr($select, $i), "\n"; } }

    Output:

    --This is a comment --But this is -- a 'useful comment' goes here? -- comment with 'a' quote -- comment with 'a --' comment

    Hope that helps,

    Athanasius <°(((><contra mundum

      I like the look of this approach, but is this a valid comment ?

       select stuff from that -- a 'useful comment' goes here?
        Yes, as the quote comes after a free-and-clear '--'
Re: Regex: Identifying comments
by pvaldes (Chaplain) on Aug 30, 2012 at 10:06 UTC

    The comments in a sql file will be probably either at the start of the line with or without white spaces "^\s*--" or after a semicolon "\;\s*--". If I'm not wrong, this is not a valid sql sentence in most databases, thus you shouldn't have it:

    select '--foo' from mytable --bar

    (And you shouldn't have tables with name fields beginning by "--")

    Something like this should be enough to strip all comments from your file:

    open my ORIGINALFILE, '<', 'file.sql' or die... etc open my CLEANEDFILE, '>', outfile... etc, as usual while (<ORIGINALFILE>){ next if /^\s*--.*$/ ; s/\;\s*--.*$/;/ ; print CLEANEDFILE $_ ; }
Re: Regex: Identifying comments
by remiah (Hermit) on Aug 30, 2012 at 13:53 UTC

    How about this? I am going to match every sql with $2 and print. Ignore $1.

    use strict; use warnings; while (<DATA>) { chomp; while (m/(--.*?$) #that is comment | ((?:'[^']*'|[^'-]*)) #quated part of sql or ordinal part /xg) { #global, extended print $2 if $2; } print "\n"; } __DATA__ select 'text' from foo --This is a comment select '--Not a valid comment' from foo --But this is select q from z -- as is this select '--This is not a valid comment' from foo select '--Not this' + '--either' from foo select 'qaws' + make from "a" -- comment with 'a' quote select 'a' from 'b' with 'c' -- comment with 'a --' comment -- test comment (add1) select 'text\'s' from foo --escaped ... (add2) select 'text\'s' from foo --escaped' ... (add3)

    I am not so confident with regex and add escaped test.
    Then, add2 and add3 is not working well... sigh.

      IMHO the basic problem with your sql files is that don't look really like sql files, thus anything can be a comment and anything can be wrong, even if perl is acting fine.

      I don't know all databases in the whole world so don't take this advice as immutable, of course, but sql sentences normally aren't complete (and don't run) until you find a semicolon. Is also surprising that you could select a field named '--foo' when you can't even create a table like this:

      create table (--foo int); # error create table ('--foo' int); # error create table ("--foo" int); # error create table (\-\-foo int); # big error

      So you are probably trying to solve a nonexistent problem. Comments are always after a semicolon or in your own line. And it you have i.e this:

      -- ; select * from mytable where foo = "--34"; -- hi ', hi'. "hi'

      The complete line IS a comment here. I'm quoting lines that I don't want to run (temporally) all the time in sql files, it's not unusual to see something like this.

      So you should be pointing your arrows to ";" instead to "'". You probably should check and sanitize first what's in your files, before to pass this files to perl

        Thanks for reply, pvaldes.

        I noticed I misunderstood how to escape single quotes in sql. I thought of the case like this

        insert into foo (bar,boo) values('''a', 'b');
        And this is nothing surprise.
        select '--foo';
        I changed test data with real SQL with your advice. SQLite and Postgres accept these commands.
        drop table foo; create table foo( --tabel foo bar text, --fld name boo text --again ); --populate insert into foo (bar,boo) values('''a', 'b'); -- insert quote's sake insert into foo (bar,boo) values('c', 'd'); -- another line select 'test' || 'abc' as a ; -- maybe you use dual wi +th oracle? select 'test''s' || 'abc' as a ; -- maybe you use dual wi +th oracle? select '--foo';
        So far, so good.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://990479]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2014-07-31 01:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (244 votes), past polls