Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

how to extract string by possible groupings?

by adrive (Scribe)
on Jun 02, 2014 at 14:15 UTC ( #1088265=perlquestion: print w/replies, xml ) Need Help??

adrive has asked for the wisdom of the Perl Monks concerning the following question:

im not really good in regex...but this is something that is making me pull my hair. Basically, im trying to read from a result file that is something like this :
Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None
I'm trying to extract Title, Percent2, and Percent3 into individual variables. Percent2 and Percent3 can be percentage or 'None'. Here's what I did :
#!perl local @match; open(FILE_EXPECTED_RESULT, "< gcov_report.txt"); while(<FILE_EXPECTED_RESULT>) { chomp($_); if ($_ ne "") { print $_ . "\n"; (@match) =($_ =~ /((.*\.c\s)|(.*\.h\s)|(.*\.cpp\s))|(\s+(. +*)\%\s+(of)\s+\d+\s)|(\bNone\b)/g); print "title : ".$match[0]."\n"; print "percent2 : ".$match[1]."\n"; print "percent3 : ".$match[2]."\n"; } } close (FILE_EXPECTED_RESULT);
it is returning weird stuff :
Title Percent2 Percent3 title : percent2 : percent3 : test1.cpp 0.00% of 21 0.00% of 16 title : test1.cpp percent2 : percent3 : test2.c None 16.53% of 484 title : test2.c percent2 : test2.c percent3 : test3.h 0.00% of 138 None title : test3.h percent2 : percent3 : test3.h

Replies are listed 'Best First'.
Re: how to extract string by possible groupings?
by LanX (Archbishop) on Jun 02, 2014 at 14:40 UTC
    I think you are confused about how groupings work

    /((.*\.c\s)|(.*\.h\s)|(.*\.cpp\s))|(\s+(.*)\%\s+(of+)\s+\d+\s)|(\bNone +\b)/g #01 2 3 4 5 6 7

    each opening bracket starts a grouping. Groupings that don't match will be undef !

    You can use extended regex (?:PATTERN) for clustering but not grouping to skip an index

    update

    ... or even avoid (...) where you don't need any clustering at all (like in your or-branches).

    Cheers Rolf

    (addicted to the Perl Programming Language)

      It's possibly that a capture group was missed in your explanation:

      /((.*\.c\s)|(.*\.h\s)|(.*\.cpp\s))|(\s+(.*)\%\s+(of+)\s+\d+\s)|(\bNone +­\b)/g #01 2 3 4 5 6 7

      If you lay that out using the /x modifier it becomes more obvious:

      / ( # 1 (.*\.c\s) # 2 | (.*\.h\s) # 3 | (.*\.cpp\s) # 4 ) | (\s+ # 5 (.*)\%\s+ # 6 (of+)\s+\d+\s # 7 ) | (\bNone­\b) # 8 /gx

      My preference would be to first reduce the capturing to just those parts that are needed. For example, it's unlikely that one would want both "1" and "2", "3", and "4". Likewise, it's unlikely that someone would care about "5" while also caring about "6", and "7".

      Second, resort to named captures: (?<somename>...). And third, to look at breaking it up into smaller problems with /g and \G

      I think, in particular, that named captures and (?:...) grouping where capturing isn't needed would make this easier to use.


      Dave

        ... named captures ...

        I think I would opt for a different course. Elaborating (well, second-guessing, really) on the example below, once you have validated a line , and given that the fields are completely mutually exclusive, the fields just pop out and go down as smoothly as oysters, with no capturing at all (update: no capturing to capture groups, that is).

        c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common; ;; my @lines = ( 'test1.cpp 0.00% of 21 0.00% of 16', 'test2.c None 16.53% of 484', 'test3.h 0.00% of 138 None', '/x/y/foo.c 0.00% of 1 None', ); ;; my $title = qr{ \w+ (?: [.] \w+)* }xms; my $percent = qr{ $RE{num}{real} % \s+ of \s+ \d+ }xms; my $none = qr{ None }xms; ;; for my $line (@lines) { print qq{line '$line'}; die qq{ BAD LINE: '$line'} unless $line =~ m{ \A $title (?: \s+ (?: $percent | $none)){2} \s* \z }xms; my ($t, $p1, $p2) = $line =~ m{ \A $title | $percent | $none }xmsg; print qq{ title: '$t' pcent1: '$p1' pcent2: '$p2'}; } " line 'test1.cpp 0.00% of 21 0.00% of 16' title: 'test1.cpp' pcent1: '0.00% of 21' pcent2: '0.00% of 16' line 'test2.c None 16.53% of 484' title: 'test2.c' pcent1: 'None' pcent2: '16.53% of 484' line 'test3.h 0.00% of 138 None' title: 'test3.h' pcent1: '0.00% of 138' pcent2: 'None' line '/x/y/foo.c 0.00% of 1 None' BAD LINE: '/x/y/foo.c 0.00% of 1 None' at -e line 1.

        Updates:

        1. Actually removed capturing groups from validation regex.
        2. It turns out the fields are not "completely mutually exclusive" as I originally claimed, so I had to change the extraction regex from
              m{ $title | $percent | $none }xmsg
          to
              m{ \A $title | $percent | $none }xmsg
          This somewhat vitiates the intended thrust of this post, but I think the main point stands. Oh, well...

        > It's possibly that a capture group was missed in your explanation:

        no, I started counting with 0 and you with 1.

        see Re^3: how to extract string by possible groupings? for why I did what I did! :)

        Cheers Rolf

        (addicted to the Perl Programming Language)


      /((.*\.c\s)|(.*\.h\s)|(.*\.cpp\s))|(\s+(.*)\%\s+(of+)\s+\d+\s)|(\bNone\b)/g
      #01         2         3            4   5        6              7

      Capture group numbering begins at 1, not 0, so the capture group variables corresponding to the capturing groups in the example would be $1 .. $8. In the  @- and  @+ arrays, the offsets of the entire match are held at index 0. Otherwise,  $0 holds the script name. See Variables related to regular expressions and perlvar in general.

      Update: The [originally posted] question was for a match in list context ... which returns the matches as a list into an array. Quite right; my mistake.

        See OP

        The question was for a match in list context

        (@match) = ( $_ =~ /.../g )

        which returns the matches as a list into an array.

        i.e. $match[0]=$1 ( see perlop ¹ )

        I didn't want to confuse with more details than necessary...

        Cheers Rolf

        (addicted to the Perl Programming Language)

        update

        well actually the /g modifier isn't necessary and might produce too many matches...

        DB<110> @matches = ('abcd' =~ /(.)(.)/) => ("a", "b") DB<111> @matches = ('abcd' =~ /(.)(.)/g) => ("a", "b", "c", "d") DB<112> $matches[0] => "a"

        ¹) perlop#Regexp-Quote-Like-Operators

        * Matching in list context

        If the "/g" option is not used, "m//" in list context returns a list consisting of the subexpressions matched by the parentheses in the pattern, i.e., ($1, $2, $3...).

      I tried to come up with a similar illustration of how grouping works but gave up after 10 minutes of coming up with nothing comprehendable. I think you managed to do it quite elegantly, for which ++.

        Thanks, but we answered this questions already so many times, I even doubt this visualization was originally my idea! :)

        Cheers Rolf

        (addicted to the Perl Programming Language)

Re: how to extract string by possible groupings?
by muba (Priest) on Jun 02, 2014 at 15:42 UTC

    There are six things obviously wrong with your regex:

    1. \s matches a single whitespace character, but as far as I can tell from your sample input, there could be multiple spaces between the columns. \s should be written \s+.
    2. You have included the \s+ inside the parens, meaning that the white spaces separating the columns are part of the data you're trying to capture (in other words, $match[0] won't be "test1.cpp", it will actually be "test1.cpp     ", and likewise $match[1] will have trailing spaces).
    3. A percent sign doesn't carry any special meaning inside regular expressions, and thus it doesn't need to be escaped.
    4. You use the /g modifier even though you don't need it.
    5. Your grouping and capturing is a little off, and way too complex.
    6. A good practice is DRY, or Don't Repeat Yourself. A good way to adhere to the DRY principle is to generalize stuff as much as possible. You violate this principle, though.

    Regarding grouping and capturing, remember that every pair of parens inside a regex creates a capturing group, and captured substrings are returned in order of appearance (added: as LanX++ beautifully illustrated). Consider the following snippet:

    $string = "foo bar"; @match = $string =~ m/(f(oo)) (b(ar))/ print "$match[0]\n"; # prints "foo" (captured by /(f(oo))/ print "$match[1]\n"; # prints "oo" (captured by /(oo)/ print "$match[2]\n"; # prints "bar" (captured by /(b(ar))/ print "$match[3]\n"; # prints "ar" (captured by /(ar)/

    Likewise, you seem to think that your @match variable will contain three elements, but as a matter of fact it will contain 8 (eight!): one for every pair of parens in your regex, some of which only surround non-data such as the word "of" or just whitespace \s+.

    Don't believe me? Do me a favour and run this snippet (in which I only fixed the \s vs \s+ issue)

    use Data::Dumper; while (chomp(my $line = <DATA>)) { @match = $line =~ m/((.*\.c\s+)|(.*\.h\s+)|(.*\.cpp\s+))|(\s+(.*) +\%\s+(of)\s+\d+\s)|(\bNone\b)/; print "$line\n"; print Dumper \@match; } __DATA__ Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None

    The output I get:

    [... snip ...] test1.cpp 0.00% of 21 0.00% of 16 $VAR1 = [ 'test1.cpp ', undef, undef, 'test1.cpp ', undef, undef, undef, undef [... snip ...]

    This neatly demonstrates at least three things:

    1. You've captured the filename twice (once because of the outer group, once because of the extension-specific group for .cpp).
    2. The matched file name includes the trailing white space, which I don't think is part of the filename anyway.
    3. Your @match array contains way more elements than you think it does - nearly three times as much!

    As for the DRY principle, you violate this for example in the chunk of the regex where you try to capture the file names. What you have written is: "match any number of characters, a literal period, a literal 'c', white space; OR match any number of characters, a literal period, a literal 'cpp', white space space; OR match any (...)" I'm sure you get the pattern.

    The way I would have written it, would read as: "match any number of characters, a literal period, one of these literal strings ('c', 'cpp', 'h'), whitespace."

    /(.*\.(?:c|cpp|h))\s+/ # Use (?:...) to create a non-capturing group +.

    The readability of your script could use some work too. Here's how I would've written it:

    # I always start my script with these two lines. # They prevent you from making various mistakes # and make debugging a whole lot easier. use strict; use warnings; # Regular expressions have the tendency to become long # strings of near-undecipherable line noise. To avoid # that, I usually like to split them up in smaller # logical chunks. # In this case, I'd write one regex to capture the # file names and one regex to capture percentages. my $title_re = qr/.*\.(?:c|cpp|h)/; my $percent_re = qr/(?:\d+\.\d+% of \d+|None)/; # Next thing is to combine them into a single # regex to match the input against. # I use the /x modifier so that I can use # white space and comments inside the tegex. my $line_re = qr/ ($title_re) \s+ # Match and capture file names, match whit +espace ($percent_re) \s+ # Match and capture Percent2, match non-da +ta ($percent_re) # Match and capture Percent3 /x; <DATA>; # Read and discard the first line, as this contains non-data. # Read input line by line, cut off newline # characters from the end. while (my $line = <DATA>) { chomp $line; # Match input against the regex, capture # the stuff into separate variables. # I mean, I find a "$title" much more # comprehensible than "$match[0]". my ($title, $percent2, $percent3) = $line =~ $line_re; print "$line\n"; print "Title: $title\n"; print "Percent2: $percent2\n"; print "Percent3: $percent3\n"; print "\n"; } __DATA__ Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None
    test1.cpp 0.00% of 21 0.00% of 16 Title: test1.cpp Percent2: 0.00% of 21 Percent3: 0.00% of 16 test2.c None 16.53% of 484 Title: test2.c Percent2: None Percent3: 16.53% of 484 test3.h 0.00% of 138 None Title: test3.h Percent2: 0.00% of 138 Percent3: None C:\Users\Lona\Desktop>perl x.pl test1.cpp 0.00% of 21 0.00% of 16 Title: test1.cpp Percent2: 0.00% of 21 Percent3: 0.00% of 16 test2.c None 16.53% of 484 Title: test2.c Percent2: None Percent3: 16.53% of 484 test3.h 0.00% of 138 None Title: test3.h Percent2: 0.00% of 138 Percent3: None
      I wish I could upvote more than once such a useful, detailed and complete post.

        As much as those warm words are appreciated, I do think I could've been even more complete by including links to relevant sections of the documentation, but I didn't feel like it ;)

      thanks! this is really clear and easy to understand. although, what does the symbol ":?" mean? also..i didn't even know qr can prepare regex pattern.. I guess I'm too rusty in perl!!
        > what does the symbol ":?" mean

        its (?:...) not :?

        see (like already mentioned) perlre#Extended-Patterns

        Cheers Rolf

        (addicted to the Perl Programming Language)

        (?:...) is used for non capturing parentheses. This is useful when you need to regroup a subpattern (for example for an alternation or a quantification), but are not interested in capturing the content in $1, $2, etc.
Re: how to extract string by possible groupings?
by no_slogan (Deacon) on Jun 02, 2014 at 14:42 UTC

    Can you maybe use something like:

    @match = split /\s{2,}/, $_;

    or

    @match = split /\t/, $_;
      oh man..........this is the simpliest and it is applicable to my case since the group separation is only if it is more than 1 space. thanks a bunch
Re: how to extract string by possible groupings?
by AnomalousMonk (Chancellor) on Jun 02, 2014 at 16:31 UTC

    The approach of factoring regex sub-expressions can also be helpful.  $RE{num}{real} is from Regexp::Common. The  $title regex won't properly match something like  '/foo/bar/test.c' so this regex (and the others) may need to be refined; this is easier to do if regexes have been factored into individual components.

    c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common; ;; my @lines = ( 'test1.cpp 0.00% of 21 0.00% of 16', 'test2.c None 16.53% of 484', 'test3.h 0.00% of 138 None', '/x/y/foo.c 0.00% of 1 None', ); ;; my $title = qr{ \w+ (?: [.] \w+)* }xms; my $percent = qr{ $RE{num}{real} % \s+ of \s+ \d+ }xms; my $none = qr{ None }xms; ;; for my $line (@lines) { print qq{line '$line'}; die qq{bad line: '$line'} unless my ($t, $p1, $p2) = $line =~ m{ \A ($title) \s+ ($percent | $none) \s+ ($percent | $none) \s* \ +z }xms; print qq{ title: '$t' pcent1: '$p1' pcent2: '$p2'}; } " line 'test1.cpp 0.00% of 21 0.00% of 16' title: 'test1.cpp' pcent1: '0.00% of 21' pcent2: '0.00% of 16' line 'test2.c None 16.53% of 484' title: 'test2.c' pcent1: 'None' pcent2: '16.53% of 484' line 'test3.h 0.00% of 138 None' title: 'test3.h' pcent1: '0.00% of 138' pcent2: 'None' line '/x/y/foo.c 0.00% of 1 None' bad line: '/x/y/foo.c 0.00% of 1 None' at -e line 1.

    Update: Changed code example to better demonstrate error handling.

Re: how to extract string by possible groupings?
by BillKSmith (Parson) on Jun 03, 2014 at 04:35 UTC
    I prefer to match each field separately.
    #!perl use strict; use warnings; *FILE_EXPECTED_RESULT = *DATA; while (<FILE_EXPECTED_RESULT>) { next if /^\s*$/; chomp; print "\n", $_ , "\n"; my (@match) = / ( \w* \. (?: c|cpp|h ) ) # File Namew \s* ( None | \d{1,2}\.\d\d%\sof\s\d{1,3} ) # Percent 2 \s* ( None | \d{1,2}\.\d\d%\sof\s\d{1,3} ) # Percent 3 /xms; print "title : " . $match[0] . "\n"; print "percent2 : " . $match[1] . "\n"; print "percent3 : " . $match[2] . "\n"; } close(FILE_EXPECTED_RESULT); __DATA__ Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None
    Bill

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1088265]
Approved by boftx
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2019-06-17 23:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Is there a future for codeless software?



    Results (80 votes). Check out past polls.

    Notices?
    • (Sep 10, 2018 at 22:53 UTC) Welcome new users!