Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

regular expressions. help

by apocalyptica (Acolyte)
on Jun 29, 2004 at 19:05 UTC ( #370582=perlquestion: print w/replies, xml ) Need Help??

apocalyptica has asked for the wisdom of the Perl Monks concerning the following question:

hi all, So, I have a large text file that I am trynig to get some data out of. Each line that has the particular type of data I need starts with HISTOGRAM OF, and they all look something like this:
HISTOGRAM OF * gpa      * ( 226)    GROUPED BY * deprint   *
What I need to get out of there is the gpa (or whatever the particular value is in its place). I do so, I have tried this:
#!/usr/local/bin/perl $evalme = q[ while(<>) { s/^ //; if(/^HISTOGRAM OF(\w+)$/) { print "in loop\n"; } ];
With the simple idea of having it print "in loop" when it matches this. (Obviously, this is just a test scenario. At least, I hope it's obvious.) It's not matching anything, though. Presumably, this is because I am terrible at regular expressions. I have tried countless variations on this theme, to no avail. Could someone please point me in the right direction here? Thanks!

Replies are listed 'Best First'.
Re: regular expressions. help
by gaal (Parson) on Jun 29, 2004 at 19:35 UTC

    First of all, is that assignment to $evalme on purpose? Are you in fact evaling the variable after this?

    Second, you don't need the s/^ //; line. What it does is remove a single space from the beginning of the line. But if you want to ignore one leading space, you may as well ignore any leading whitespace:

    if (/^\s*HISTOGRAM of re.../)

    In fact, it may (or may not, depending on how well-formed your data is) be reasonable to just drop the ^ anchor.

    Finally, for the reason why this regexp fails. You have "OF(\w+)$", which reads "the word characters that continuously occupy from immediately after the letters "OF" to the end of the line". This must fail because you have non-word characters in the rest of your line, indeed you have a space *immediately* after "OF"!

    I couldn't understand if you're looking for the word that comes between the next two asterixes ("gpa" in this example) or if the next text inside parentheses (" 226" in this example) is what you want to capture. If the former, the following should work. I'm using extended regexp syntax for added readability:

    m{ HISTOGRAM \s+ OF \s+ \* \s* # a literal "*", escaped because * is a metacharacter ([^*]+?) # (capture) anything that isn't a "*", nongreedy \s* \* }x;

    The group "(^*+?)" is "nongreedy", which means that (since it is followed by \s*, whitespace) it will automatically not include any trailing whitespace between the word and the following asterix.

Re: regular expressions. help
by matija (Priest) on Jun 29, 2004 at 19:26 UTC
    Of course it's not matching: there's an asterisk in the way. You should change it to: /^HISTOGRAM OF\s*\*\s*(\w+)$/
      Hmmm... That looks like it should be correct, yes (like I said, I'm no good at regular experessions, but it looks right to me), but it's still not working. Let me just post the whole stupid program to give you an idea what I am trying to do:
      #!/usr/local/bin/perl $fl = '-?\d+\.\d+'; $evalme = q[ while(<>) { s/^ //; if(^HISTOGRAM OF\s*\*\s*(\w+)$/) { printf ("In loop.\n"); #just here for testing purposes +. write if $header; undef($cache); $header=$1; $varnum=$2; } if($header) { ]; eval <<EOM; $evalme (\$meanH, \$usersH) = (\$1, \$2) if /^GROUP\\s+(\\S+)\\s+( +\\S+)/; (\$mean, \$users) = (\$1, \$2) if /^MEAN\\s+(${fl})\\s+(${ +fl})/; \$levene = \$1 if /\\s+VARIABILITY\\s+${fl}\\s+(${fl})/; \$pooled = \$1 if /\\s+POOLED T\\s+${fl}\\s+(${fl})/; \$separate = \$1 if /\\s+SEPARATE T\\s+${fl}\\s+(${fl})/; \$mann = \$1 if /\\s+MANN-WHIT.\\s+${fl}\\s+(${fl})/; } } EOM write STDOUT; format STDOUT_TOP = | @|||| | @|||| | Levene-P | Pooled-P | Mann-P | Sep +arate $meanH, $usersH ----------+----------+----------+----------+----------+----------+---- +------ . format STDOUT = @<<<<<<<< | @##.#### | @##.#### | @##.#### | @##.#### | @##.#### | @## +.#### $header, $mean, $users, $levene, $pooled, $mann, $se +parate ----------+----------+----------+----------+----------+----------+---- +------ .
      It reads through the input file until it finds HISTOGRAM OF and then begins pulling out the data as per above. Does any of it work? Well, I don't know, I still can't get this one stupid thing to work.
        The (\w+)$ is killing you again. You match 'HISTOGRAM OF', whitespace, asterisk, whitespace, but the rest of your string is not all \w (word chars), and since you added the '$' to match until the end, the \w+ fails to match when it hits whitespace again.

        I cannot stress enough to regex learners that whitespace NEEDS to be treated like all other characters.
Re: regular expressions. help
by pzbagel (Chaplain) on Jun 29, 2004 at 19:42 UTC
    If it's always the 4th field, use split.
    my $data=(split())[3];
Re: regular expressions. help
by shemp (Deacon) on Jun 29, 2004 at 19:29 UTC
    Your regex in the if() has problems. \w represents 'word characters', letters, numbers, and underscore. Your regex is looking for a \w+ immediately after the 'OF'. You need to account for the whitespace in your regex. Im not exactly sure what your data will all potentially look like, but if the value you are looking for is the only thing in parenthesis, you could:
    if ( /\([^)]+)\)/ ) { $thing = $1; }
Re: regular expressions. help
by NovMonk (Chaplain) on Jun 29, 2004 at 19:27 UTC
    Let me see if I understand what you want-- if the line begins with "HISTOGRAM OF" you want to print the words "in loop"? Why would you need anything more than this for the match:

     if (/^HISTOGRAM OF/){...etc}?

    As to what you're doing with it, you could get at the gpa value by splitting the data on the white space and/or the asterisk. I'm not sure what you're after, but you could make an array of the gpa values that way and use them.

    That's how I'd start anyway. Hope this is helpful. Good luck.



      Well, what I'm doing is taking the value of the data in the fourth field (in this example, it is gpa, but it could be a whole host of random letters strung together) and putting it into a variable. But, that isn't the problem I'm having right now -- the problem is getting the blasted thing to match and acknowledge that there is anything there.

        When trying to construct an re to match something, perl -de 0 can be very helpful. Set $_ to your sample data and then you can iteratively construct your re with x /blah/ (seeing if it matches and what matches at each step).

Re: regular expressions. help
by ercparker (Hermit) on Jun 29, 2004 at 23:07 UTC
    this worked for me: /^HISTOGRAM OF\s+?\*\s+(\w+)\s+/

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://370582]
Approved by NovMonk
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2023-12-04 06:11 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (21 votes). Check out past polls.