Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Understanding regular expressions: why do I have to use map to clear up undefs in regex output?

by corenth (Monk)
on Jun 19, 2009 at 21:43 UTC ( #773137=perlquestion: print w/ replies, xml ) Need Help??
corenth has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse a simple string. In doing so, I want to get rid of quotes. The regex as it stands doesn't reflect the possibility of escaped quotes.

Let's take this string: '1,2,3,4,"fine","day","today",'

If I could remove all the qoutes (or didn't care) I could use:
@a = $string =~/ ( #capture matches and put them into @a (?:\d+) | (?:"(?:.+?)") ) /gx; print Dumper \@a;
to give me :
$VAR1 = [ '1', '2', '3', '4', '"fine"', '"day"', '"today"' ];
I don't wan't those quotes, so I :
@b = $string =~/ (?: #don't capture matches (\d+) | (?:"(.+?)") #capture numbers and anything inside of quotes ) /gx; print Dumper \@b;
which gives me:
$VAR1 = [ '1', undef, '2', undef, '3', undef, '4', undef, undef, 'fine', undef, 'day', undef, 'today' ];
Oh NO!!! all those undefs and I might want to use a hash instead of an array!! Oh, what can I do?
@c = map {/.+/g} #if it's an undef, it's NOT getting into that array ;) $string =~/ (?: (\d+) | (?:"(.+?)") ) /gx; print Dumper \@c;
to get:
$VAR1 = [ '1', '2', '3', '4', 'fine', 'day', 'today' ];
LOVELY!!

But why? Granted, this exercise has helped me to understand map() alot better, but I would like to know what it is about my pattern that creates all those undefs, and I would like to know how (if there's a way) I could do this without all those undefs.

I'm happy with my solution, but I'm unhappy with the fact that I don't understand the problem. If you have an idea about how to help me understand, I'd be very thankful.

---

($state{tired})?(sleep($wink * 40)): (eat($food));

Comment on Understanding regular expressions: why do I have to use map to clear up undefs in regex output?
Select or Download Code
Re: Understanding regular expressions: why do I have to use map to clear up undefs in regex output?
by ikegami (Pope) on Jun 19, 2009 at 22:04 UTC

    I would like to know what it is about my pattern that creates all those undefs

    The match operator returns the value captured by each capture. The following has two captures:

    /(?: (\d+) | (?:"(.+?)") ) /gx ^ ^ | |

    so two values are returned for each match. Given the pattern, one will always be undef since one will always be outside the path that matched.

    It's not evident from your code that it would be a problem to only return one value because you treat both values equally. In real life, you almost always want to treat the two kinds of matches differently. For example,

    sub dequote { my ($s) = @_; $s =~ s/\\(.)/$1/g; return $s; } push @matches, defined($1) ? $1 : dequote($2) while /(?: (\d+) | (?:"(.+?)") ) /gx;

    If the pattern only returned one value, you wouldn't be able to tell which part of the pattern matched, so you couldn't take decisions based on that (such as whether to call dequote or not).

Re: Understanding regular expressions: why do I have to use map to clear up undefs in regex output?
by Transient (Hermit) on Jun 19, 2009 at 22:07 UTC
    From perlop:
    The /g modifier specifies global pattern matching--that is, matching a +s many times as possible within the string. How it behaves depends on + the context. In list context, it returns a list of the substrings ma +tched by any capturing parentheses in the regular expression. If ther +e are no parentheses, it returns a list of all the matched strings, a +s if there were parentheses around the whole pattern.
    The reason your first regex worked as planned was because there were no capturing parentheses. Thus only what matched was returned.

    In the second regex, you have two sets of capturing parens. If you look at the pattern of undefs in @b, you can see it's returning undef when the other half of the regex doesn't match.

    I'm not exactly sure how you could match that the way you want using one regex alone and still retain the matching capabilities.
Re: Understanding regular expressions: why do I have to use map to clear up undefs in regex output?
by jwkrahn (Monsignor) on Jun 19, 2009 at 22:15 UTC
    I would like to know what it is about my pattern that creates all those undefs

    Your pattern   (\d+) | (?:"(.+?)")   has two groups of capturing parentheses so when one group matches the other returns undef and vice-versa.

Re: Understanding regular expressions: why do I have to use map to clear up undefs in regex output?
by AnomalousMonk (Monsignor) on Jun 19, 2009 at 23:14 UTC
    BTW, the use of the statement  map { /.+/g } to filter out undefined values works only if the warning about uninitialized values in pattern matches is turned off, and has the additional shortcoming of filtering out empty strings.

    Much better, IMO, to grep for only defined values (single-quotes used in example instead of double-quotes to avoid lots of escaping on XP command line):

    >perl -wMstrict -MData::Dumper -le "my $str = q{1,2,3,4,'fine', '', 'day','today',}; my @f = grep defined, $str =~ m{ (\d+) | ' ([^']*) ' }xmsg ; print Dumper \@f; " $VAR1 = [ '1', '2', '3', '4', 'fine', '', 'day', 'today' ];
Re: Understanding regular expressions: why do I have to use map to clear up undefs in regex output?
by Marshall (Prior) on Jun 20, 2009 at 16:52 UTC
    Another method is to use regex s/// to clean up string and then just use split.
    #!/usr/bin/perl -w use strict; use Data::Dumper; my $s = '1,2,3,4,"fine",,,"day","today",,,'; $s =~ s/['"]//g; #no quotes my @s = split(/,/,$s); print Dumper \@s; __END__ $VAR1 = [ '1', '2', '3', '4', 'fine', '', '', 'day', 'today' ];
    Update: I personally prefer the {} syntax of grep. Example: to "get rid of the blank tokens above, just @s = grep {/\S/}@s;. If you want to get rid of some undef values: @s = grep{defined $_}@s. Perl grep is a filtering operation. Perl map is a transformation operation. Use grep when you just want a subset of the data and aren't changing it. Use map when you are transforming the data input into something else.
Re: Understanding regular expressions: why do I have to use map to clear up undefs in regex output?
by ack (Deacon) on Jun 22, 2009 at 04:06 UTC

    I know that you're inquiring about matching regex's, but if all you want to do is to remove the double-quotes, you could, I think and IMHO, use the substitution regex approach as I did in the code below.

    #!/usr/bin/perl use warnings; use strict; my $string = '1,2,3,4,"fine","day","today"'; print "\$string = [ $string ]\n"; print "\n"; $string =~ s/\"//gx; print "\$string = [ $string ]\n"; exit(0);

    I tried several variations using the match regex approach and couldn't get one to work. I'm sure there is a way; I just couldn't easily construct one. I'm barely regex-literate, though, so the various responses are interesting to me and have already tought me a lot that I didn't know.

    ack Albuquerque, NM
Re: Understanding regular expressions: why do I have to use map to clear up undefs in regex output?
by eric256 (Parson) on Jun 22, 2009 at 16:57 UTC

    Yet another solution ;) Just remove the quotes after you do your splitting.

    @b = map { s/\"//g; $_ } @a;


    ___________
    Eric Hodges
Re: Understanding regular expressions: why do I have to use map to clear up undefs in regex output?
by corenth (Monk) on Jun 26, 2009 at 23:51 UTC
    Too many good responses. I'm partial to the grep{} ones myself.

    While the code I posted doesn't show it, I am paying attention to the likelyhood of escaped quotes, so I'm keeping that in mind while I play with this.

    Thank you! I'm getting a lot out of this.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://773137]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (16)
As of 2014-07-25 20:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls