Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Group matching - Extracting what matches

by madbee (Acolyte)
on Jul 02, 2013 at 00:43 UTC ( #1041936=perlquestion: print w/replies, xml ) Need Help??
madbee has asked for the wisdom of the Perl Monks concerning the following question:


I'm trying to parse a string and extracting the word before based on group matches. Below is my code:

$str = "This is a string which has bi-weekly data"; if ($str =~ /(\w+)[- ](daily|month|days|weekly|week)/) { $found= $1; } print $found

This works well and returns "bi" which is what I found. However, I also want to know which of the terms matches. Is there a way I can print that as well? So, my final output should be: bi, weekly. or 12,month

Thanks a lot in advance


Replies are listed 'Best First'.
Re: Group matching - Extracting what matches
by toolic (Bishop) on Jul 02, 2013 at 00:55 UTC
    You are already capturing it in $2. You just need to print it:
    use warnings; use strict; my $str = "This is a string which has bi-weekly data"; if ($str =~ /(\w+)[- ](daily|month|days|weekly|week)/) { my $found= $1; print "$found $2\n"; } __END__ bi weekly
Re: Group matching - Extracting what matches
by rjt (Deacon) on Jul 02, 2013 at 01:54 UTC

    You already have captured the (daily|month|days|...) group as $2, so you can just do something like:

      my ($found, $term) = ($1, $2);

    However, I question whether your regexp is going to be adequate. Usually "daily" doesn't have anything relevant in front of it, so you'd be capturing, for example, "Here is (my) (daily) string!" (captured groups in (bold parens). And do you really want "(bi)-(month)ly" captured as such? "Every (other) (week)"? "Biweekly" (no hyphen) and "semiweekly" are usually acceptable in English as well.

    What I'm trying to say is, parsing periodical time periods in English is hard enough, but pulling them out of sentences will be even harder. My suggestion would be to precisely match the entire periodic period, so you don't pull in extraneous information. Your regexp will not match as often, but some false negatives are likely preferable to inaccurate parsing.

    I found a periodic frequency list for you that contains additional terms (some of them archaic and likely not applicable). I suggest more research. Then, I'd start building a regexp something like what I've started below.

    Note, of course, that this is only a suggestion of a starting point. What I've come up with is certainly incomplete and needs to be expanded and tested rigorously with a sizable corpus of input strings.

    #!/usr/bin/env perl use 5.010; use warnings; use strict; my $NUMBER = qr/(?i:three|four|five|six|seven|eight|nine|ten|\d+)/; my $PERIOD = qr/(?i:day|week|month|quarter|year)/; for ( map { chomp; $_ } <DATA> ) { say "`$_' contains `$1'" if /\b ( (?:bi|semi)? [-]? (?:weekly|monthly) | (?:every\sother | twice\s)? (?:daily|monthly|quarterly|a +nnually) | (?:once|twice|$NUMBER\stimes)\s (?:a|per)\s $PERIOD | (?:every\s(?:(?:other|twice)\s)?)? $PERIOD | (?:se|bi)?mestral ) \b /xi; } __DATA__ Here is the weekly TPS report. I go for a walk semimonthly. How often do you clean this toilet? Quarterly?! The sun comes up seven times per week. I get older every year. Not many people say "bimestral" anymore.


    `Here is the weekly TPS report.' contains `weekly' `I go for a walk semimonthly.' contains `semimonthly' `How often do you clean this toilet? Quarterly?!' contains `Quarterly' `The sun comes up seven times per week.' contains `seven times per wee +k' `I get older every year.' contains `every year' `Not many people say "bimestral" anymore.' contains `bimestral'

    Only once you are able to extract the entire period would I suggest you then attempt to parse it. (i.e., once you have "bimonthly", further parse or interpret that as you see fit).

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1041936]
Approved by davido
Front-paged by rjt
[LanX]: seems like my boss has activated an extra UTF8 encoding such that my JSON stuff arives twice encoded in the browser ... oO
[LanX]: he loves to do this with regexes ...
LanX considers looking for a new project ...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2018-03-19 23:24 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (246 votes). Check out past polls.