Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

dumb regex question

by linuxfan (Beadle)
on Apr 06, 2009 at 23:57 UTC ( #755895=perlquestion: print w/ replies, xml ) Need Help??
linuxfan has asked for the wisdom of the Perl Monks concerning the following question:

Monks,


I've not been able to figure this simple regex problem. I need to match strings in one of the two formats below (after reading from a file):

"/moreIters 10"
"/bootMe any text here"
/fewIter

If the string begins with a quote, I need all data enclosed within the quotes, else the entire string that begins with /

To begin with, I wrote the following simplistic regex:

$str =~ m,"?(/.*)"?,
which means match zero or 1 double quote, followed by / followed by any number of characters followed by an optional double quote. However, this doesn't work because .* matches the quotes as well, and I get the following:
my $str = qq( "/extend 100" ); $str =~ m,"?(/.*)"?,; if (defined $1) { print "matched=$1\n"; } $perl regex.pl matched=/extend 100"
The above works for strings such as /justThis:
my $str = qq(/extendMe ); $str =~ m,"?(/.*)"?,; if (defined $1) { print "matched=$1\n"; } $perl regex.pl matched=/extendMe
How can I refine this regex to get the desired result?

Thanks!

Comment on dumb regex question
Select or Download Code
Re: dumb regex question
by Nkuvu (Priest) on Apr 07, 2009 at 00:10 UTC

    I'd change the regex to exclude quotes, rather than match everything: m,"?(/[^"]*)"?,

    Test script (including some lines where I tried to break the match):

    #!/usr/bin/perl use strict; use warnings; while (my $line = <DATA>) { chomp $line; if ($line =~ m,"?(/[^"]*)"?,) { print "Line matched: $line ($1)\n"; } else { print "Line didn't match: $line\n"; } } __DATA__ "/moreIters 10" "/bootMe any text here" /fewIter /some stuff here "/albatross" foo bar baz monkeys leprechauns /not monkeys /gnomes "not leprechauns though"

    Output:

    Line matched: "/moreIters 10" (/moreIters 10) Line matched: "/bootMe any text here" (/bootMe any text here) Line matched: /fewIter (/fewIter) Line matched: /some stuff here (/some stuff here) Line matched: "/albatross" foo bar baz (/albatross) Line didn't match: monkeys Line matched: leprechauns /not monkeys (/not monkeys) Line matched: /gnomes "not leprechauns though" (/gnomes )

      Thank you so much. This is exactly what I wanted.
      I just noticed that this regex fails for the following input:
      /gnomes more data here
      My expected string is only /gnomes, whereas it matches everything upto end of the line.. Any idea on how to fix this?

      thanks

        With that additional qualification, it will get a bit more tricky. My first thought was to add a space to the character class: m,"?(/[^" ]*)"?,

        But that doesn't work because it won't care that it has found a space inside or outside of a quote, and will stop the regex. Meaning it would capture just "/bootMe" from the line "/bootMe any text here".

        I'd suggest looking into a module like Text::xSV or Text::CSV_XS and setting the delimiter to spaces. Then reject any entry that doesn't have a leading slash. This means dropping the regex entirely.

        Something like:

        #!/usr/bin/perl use strict; use warnings; use Text::CSV_XS; my $csv = Text::CSV_XS->new ({sep_char => ' '}); while (my $line = <DATA>) { chomp $line; # See perldoc Text::CSV_XS for warnings # about this approach with possible embedded # newlines: my $status = $csv->parse($line); my @fields; if ($status) { @fields = $csv->fields(); } else { warn "Problem parsing $line\n"; } for my $field (@fields) { print "Captured ($field) from $line\n" if $field =~ m!^/!; } } __DATA__ "/moreIters 10" "/bootMe any text here" /fewIter /some stuff here "/albatross" foo bar baz monkeys leprechauns /not monkeys /gnomes "not leprechauns though" /gnomes more data here

        Which gives the output:

        Captured (/moreIters 10) from "/moreIters 10" Captured (/bootMe any text here) from "/bootMe any text here" Captured (/fewIter) from /fewIter Captured (/some) from /some stuff here Captured (/albatross) from "/albatross" foo bar baz Captured (/not) from leprechauns /not monkeys Captured (/gnomes) from /gnomes "not leprechauns though" Captured (/gnomes) from /gnomes more data here

        if (m{"(/[^"]+)"|(/\S+)}) { my $match = defined $1 ? $1 : $2; ... }
        Or whatever's appropriate instead of \S.

        Update: Fixed slashes

Re: dumb regex question
by ELISHEVA (Prior) on Apr 07, 2009 at 10:08 UTC

    ikegami's solution will work nicely if you only have one such pattern per line. If you have more than one such pattern, you might want to try using the g modifier with your regex. This lets you repeated search for a pattern in a line, like this:

    use strict; use warnings; my $FORMAT="%-48s match=%s\n"; while (my $line = <DATA>) { chomp $line; my $bFound=0; while ($line =~ m{"/([^"]+)"|/(\S+)}g) { my $sMatch = defined($1) ? $1 : $2; print sprintf($FORMAT, "<$line>", "<$sMatch>"); $bFound=1; } print sprintf($FORMAT, "<$line>", "---") unless $bFound; } __DATA__ "/moreIters 10" "/blah X Y Z" /fewIter /some stuff here "/albatross" foo bar baz monkeys leprechauns /not monkeys /gnomes "not leprechauns though" /grumpy gnomes "/silly elves" "/funny" fairies

    which outputs

    <"/moreIters 10"> match=<moreIters 10> <"/blah X Y Z"> match=<blah X Y Z> </fewIter> match=<fewIter> </some stuff here> match=<some> <"/albatross" foo bar baz> match=<albatross> <monkeys> match=--- <leprechauns /not monkeys> match=<not> </gnomes "not leprechauns though"> match=<gnomes> </grumpy gnomes "/silly elves" "/funny" fairies> match=<grumpy> </grumpy gnomes "/silly elves" "/funny" fairies> match=<silly elves> </grumpy gnomes "/silly elves" "/funny" fairies> match=<funny>

    Best, beth

      Thanks for all replies, holy monks! You are the best :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://755895]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (11)
As of 2014-07-30 03:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (229 votes), past polls