Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

dumb regex question

by linuxfan (Beadle)
on Apr 06, 2009 at 23:57 UTC ( #755895=perlquestion: print w/replies, xml ) Need Help??

linuxfan has asked for the wisdom of the Perl Monks concerning the following question:


I've not been able to figure this simple regex problem. I need to match strings in one of the two formats below (after reading from a file):

"/moreIters 10"
"/bootMe any text here"

If the string begins with a quote, I need all data enclosed within the quotes, else the entire string that begins with /

To begin with, I wrote the following simplistic regex:

$str =~ m,"?(/.*)"?,
which means match zero or 1 double quote, followed by / followed by any number of characters followed by an optional double quote. However, this doesn't work because .* matches the quotes as well, and I get the following:
my $str = qq( "/extend 100" ); $str =~ m,"?(/.*)"?,; if (defined $1) { print "matched=$1\n"; } $perl matched=/extend 100"
The above works for strings such as /justThis:
my $str = qq(/extendMe ); $str =~ m,"?(/.*)"?,; if (defined $1) { print "matched=$1\n"; } $perl matched=/extendMe
How can I refine this regex to get the desired result?


Replies are listed 'Best First'.
Re: dumb regex question
by Nkuvu (Priest) on Apr 07, 2009 at 00:10 UTC

    I'd change the regex to exclude quotes, rather than match everything: m,"?(/[^"]*)"?,

    Test script (including some lines where I tried to break the match):

    #!/usr/bin/perl use strict; use warnings; while (my $line = <DATA>) { chomp $line; if ($line =~ m,"?(/[^"]*)"?,) { print "Line matched: $line ($1)\n"; } else { print "Line didn't match: $line\n"; } } __DATA__ "/moreIters 10" "/bootMe any text here" /fewIter /some stuff here "/albatross" foo bar baz monkeys leprechauns /not monkeys /gnomes "not leprechauns though"


    Line matched: "/moreIters 10" (/moreIters 10) Line matched: "/bootMe any text here" (/bootMe any text here) Line matched: /fewIter (/fewIter) Line matched: /some stuff here (/some stuff here) Line matched: "/albatross" foo bar baz (/albatross) Line didn't match: monkeys Line matched: leprechauns /not monkeys (/not monkeys) Line matched: /gnomes "not leprechauns though" (/gnomes )

      I just noticed that this regex fails for the following input:
      /gnomes more data here
      My expected string is only /gnomes, whereas it matches everything upto end of the line.. Any idea on how to fix this?


        if (m{"(/[^"]+)"|(/\S+)}) { my $match = defined $1 ? $1 : $2; ... }
        Or whatever's appropriate instead of \S.

        Update: Fixed slashes

        With that additional qualification, it will get a bit more tricky. My first thought was to add a space to the character class: m,"?(/[^" ]*)"?,

        But that doesn't work because it won't care that it has found a space inside or outside of a quote, and will stop the regex. Meaning it would capture just "/bootMe" from the line "/bootMe any text here".

        I'd suggest looking into a module like Text::xSV or Text::CSV_XS and setting the delimiter to spaces. Then reject any entry that doesn't have a leading slash. This means dropping the regex entirely.

        Something like:

        #!/usr/bin/perl use strict; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({sep_char => ' '}); while (my $line = <DATA>) { chomp $line; # See perldoc Text::CSV_XS for warnings # about this approach with possible embedded # newlines: my $status = $csv->parse($line); my @fields; if ($status) { @fields = $csv->fields(); } else { warn "Problem parsing $line\n"; } for my $field (@fields) { print "Captured ($field) from $line\n" if $field =~ m!^/!; } } __DATA__ "/moreIters 10" "/bootMe any text here" /fewIter /some stuff here "/albatross" foo bar baz monkeys leprechauns /not monkeys /gnomes "not leprechauns though" /gnomes more data here

        Which gives the output:

        Captured (/moreIters 10) from "/moreIters 10" Captured (/bootMe any text here) from "/bootMe any text here" Captured (/fewIter) from /fewIter Captured (/some) from /some stuff here Captured (/albatross) from "/albatross" foo bar baz Captured (/not) from leprechauns /not monkeys Captured (/gnomes) from /gnomes "not leprechauns though" Captured (/gnomes) from /gnomes more data here

      Thank you so much. This is exactly what I wanted.
Re: dumb regex question
by ELISHEVA (Prior) on Apr 07, 2009 at 10:08 UTC

    ikegami's solution will work nicely if you only have one such pattern per line. If you have more than one such pattern, you might want to try using the g modifier with your regex. This lets you repeated search for a pattern in a line, like this:

    use strict; use warnings; my $FORMAT="%-48s match=%s\n"; while (my $line = <DATA>) { chomp $line; my $bFound=0; while ($line =~ m{"/([^"]+)"|/(\S+)}g) { my $sMatch = defined($1) ? $1 : $2; print sprintf($FORMAT, "<$line>", "<$sMatch>"); $bFound=1; } print sprintf($FORMAT, "<$line>", "---") unless $bFound; } __DATA__ "/moreIters 10" "/blah X Y Z" /fewIter /some stuff here "/albatross" foo bar baz monkeys leprechauns /not monkeys /gnomes "not leprechauns though" /grumpy gnomes "/silly elves" "/funny" fairies

    which outputs

    <"/moreIters 10"> match=<moreIters 10> <"/blah X Y Z"> match=<blah X Y Z> </fewIter> match=<fewIter> </some stuff here> match=<some> <"/albatross" foo bar baz> match=<albatross> <monkeys> match=--- <leprechauns /not monkeys> match=<not> </gnomes "not leprechauns though"> match=<gnomes> </grumpy gnomes "/silly elves" "/funny" fairies> match=<grumpy> </grumpy gnomes "/silly elves" "/funny" fairies> match=<silly elves> </grumpy gnomes "/silly elves" "/funny" fairies> match=<funny>

    Best, beth

      Thanks for all replies, holy monks! You are the best :)

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://755895]
Approved by ikegami
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (3)
As of 2021-12-02 06:25 GMT
Find Nodes?
    Voting Booth?
    R or B?

    Results (17 votes). Check out past polls.