Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Extracting string from a file

by Bindo (Acolyte)
on Nov 11, 2013 at 12:39 UTC ( #1061986=perlquestion: print w/ replies, xml ) Need Help??
Bindo has asked for the wisdom of the Perl Monks concerning the following question:

Greetings good people. I have a log file with more than 10000 lines from which I want to extract all the lines starting with a certain word on to an array and then print the values.

Following is the string I want to search

~|TOTAL 24.1% 0.4%

Kindly note that I want the word "Total" and those 2 numerical values on to an element of the array as a single entry.(There are many entries starting with Total and I want all of them stored)I'm finding it hard to come along with the regexes :( Anyways following is what I have tried. Pardon me gentlemen I'm a learner so please try to help out.

my $SYS_HOME = "/x01/home"; my $LOG_FILE = $SYS_HOME."/ABC.log"; open (FH, '<', $LOG_FILE) || die "Cant open : $!"; while (<FH>) { chomp; my @FIGURE =~ /~|TOTAL|([\d]+/; foreach (@FIGURE) { print "$_\n"; } }

Many thanks in advance.

Comment on Extracting string from a file
Download Code
Re: Extracting string from a file
by RMGir (Prior) on Nov 11, 2013 at 13:02 UTC
    As it stands, I don't think your code even compiles, since you've got an unbalanced '(' in your regular expression. I'd start by fixing that.

    Second, '|' is a special character in regular expressions, meaning "this OR that". Since you want a literal '|', you need to precede it with a '\'.

    Next, you're using '|' again between TOTAL and your digits, while your example has spaces there. Which is it?

    '\d' is a character class by itself - it doesn't go in square brackets. But since you also need to match '.', use '0-9.'. Do you also need to match negative numbers? If so, that would be '-0-9.'.

    Finally, you're only capturing the first number, and you said you need both.

    Putting all of that together, please try this regular expression:

    /\|TOTAL\s+([0-9.]+)%\s+([0-9.]+)/
    I think that might work better for you....

    Test case:

    perl -e'$_="~|TOTAL 24.1% 0.4%"; /\|TOTAL\s+([0-9.]+)%\s+([0-9.]+)/ && + print "Matched, $1 $2\n"' Matched, 24.1 0.4

    Mike
      Not to quarrel because your explanation of the regex problems is exemplary, but OP is clearly dealing with a multi-line logfile, in which some lines begin with ~|TOTAL. Hence, an array better matches the SOPW spec than the string is your "Test Case."

      Aside to Bindo: Your spec comes up a little short of perfection because (very strictly speaking and very nitpicky) there's no requirement -- merely a single illustration -- that what's captured be numeric followed by a percent sign. What if the notation were hex, binary or some sort of non-Arabic numbers? In any case, I've treated you spec as "any line that begins with tilde, pipe, 'TOTAL' followed by anything" which is the only reason my regex differs from RMGir's:

      #!/usr/bin/perl use 5.016; use warnings; #1061986 my @logfile = ("~|TOTAL 24.1% 0.4%", "~|not a total 11%", "~|TOTAL 21.0% 0.7%", "FOOBAR", "~|TOTAL 13.7% 10.2%", "~|TOTAL last5 6", ); my @FIGURE; for my $logentry(@logfile) { if ($logentry =~ /~\|(TOTAL.*)/ ) { push @FIGURE, $1; } else { say "\t \$logentry, $logentry, does not match pattern"; } } for (@FIGURE) { say $_; } =head execution: C:\>1061986.pl $logentry, ~|not a total 11%, does not match pattern $logentry, FOOBAR, does not match pattern TOTAL 24.1% 0.4% TOTAL 21.0% 0.7% TOTAL 13.7% 10.2% TOTAL last5 6 =cut

        Thank you very much for all the good advices gentlemen. Guess I owe you all a big apology since I have failed to reply. I was in the hospital due to a small accident and only last night I have been discharged. Now back at feet :)

        I tried the following code but the program wont give any output nor any errors. Please can one of you correct this code for me? Please gentlemen Im a beginner who is trying to understand the whole concept of regexes more specifically with files, So be gentle.

        my $SYS_HOME = $ENV{'SYSTEM_HOME'}; my $GD_FILE = $SYS_HOME."/GD.log"; my $FH; my @DUMP; open ($FH, '<', $GD_FILE) || die "Cant open : $!"; while (my $line = $FH) { if ($line =~ /~\|(TOTAL.*)/){ #my $tmp = $1; push @DUMP, $1; foreach (@DUMP) { print "$_\n"; } } }

        Many thanks in advance! /Bindo

        Sir can you please go through my latest reply at the end of the thread and advice? For some reason no one is replying. thanks.

      A nitpick. The \d is a character class, and it MAY go into brackets, to be combined with the rest.
      The following are equivalent: /[0-9]/, /[[:digit:]]/, /[\d]/, /\d/.

      Your regex could also be written as: /\|TOTAL\s+([\d.]+)%\s+([\d.]+)/.

      Sir can you please go through my latest reply at the end of the thread and advice? For some reason no one is replying. thanks.

Re: Extracting string from a file
by builat (Scribe) on Nov 11, 2013 at 16:35 UTC
    One more remark.
    #!/usr/bin/perl use strict; use warnings; my @dump; my $false_count = 0; open (FH, "<file_name") || die "Cant open : $!"; while (<FH>){ if (/^.\|TOTAL.*$/i){ my @tmp = $_ =~ /([0-9\.\-]+)/g; push @dump, "@tmp"; }else{ $false_count++; } } print 'Counted matches-> '.$#dump."\tUnmatched lines-> ".$false_count. +"\n"; foreach (@dump){print $_."\n";}

      builat: Close but 'no cigar.' No downvote, but pls test your code before posting and thereby implying that it constitutes a correct answer.

      Sorry, but numerous minor problems, including unnecessary complication of your code and (not exactly minor) your use -- in your Ln 6, open (FH, "<file_name")... -- of data not shown or referenced in your post. I realize it may be the same as OP's, or mine, but if you don't show it or otherwise make it unambiguous, future readers can't be sure.

      Then there's an actual code problem: $#array does NOT count the elements in the array; returns the last element's index. Since array indices start with 0, $#array is 1 less than the count of elements (or count of indices, if you prefer to think of it that way).

      #!/usr/bin/perl use 5.016; use warnings; # 1062018 builat in same thread as #1061986 my @dump; my $false_count = 0; while (<DATA>){ chomp ($_); if ($_ =~ /~\|(TOTAL.*)/ ) { my $tmp = $1; push @dump, $tmp; } else { say "False: |\"$_\"| does not match pattern"; $false_count++; } } say "\n\t DEBUG \$#dump: $#dump"; say "\t NB: last index of the array and thus 1 less than the count of +array elements!\n"; say 'Counted matches-> '. ($#dump + 1) . "\tUnmatched lines-> " . $fal +se_count; for (@dump){ say $_."\n"; } =head execution C:\> 1062018.pl False: |"~|first"| does not match pattern False: |"~|not a total 11%, "| does not match pattern False: |"FOOBAR, "| does not match pattern DEBUG $#dump: 3 NB: last index of the array and thus 1 less than the count of + array elements! Counted matches-> 4 Unmatched lines-> 3 TOTAL 24.1% 0.4%, TOTAL 21.0% 0.7%, TOTAL 13.7% 10.2%, TOTAL last5 6 =cut __DATA__ ~|first ~|TOTAL 24.1% 0.4%, ~|not a total 11%, ~|TOTAL 21.0% 0.7%, FOOBAR, ~|TOTAL 13.7% 10.2%, ~|TOTAL last5 6
      I have a problem with your regex /^.\|TOTAL.*$/i, more specifically with the .*$ part of it.

      That part actually says "anything (even nothing) until the end of the string" and is therefore superfluous.

      Worse is that /^.\|TOTAL.*$/i will allow to pass a line without any digits in it and therefore will push nothing on the @dump array but neither is the $false_count variable incremented. Of course it is very well possible that the file will always have digits on its "TOTAL" lines, but IMHO that is a dangerous assumption to make.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
        You are absolutely right. And yes I really started from the premise that any string beginning with ~ | TOTAL further contains a set of numbers. It would be better to add a check for the presence of numbers on the right side of expression. Thank you.
Re: Extracting string from a file
by sundialsvc4 (Monsignor) on Nov 12, 2013 at 16:23 UTC

    There is, of course, “more than one way to do it,™” but I think that the way that I would do it is to use the /g modifier as discussed in perldoc perlretut.

    Something like ... (caution... extemporaneous code; your syntax may vary)

    while (my $line = <FH>) { next unless $line =~ /^\~\|TOTAL/; my @percents = ( $line =~ /([\d\.]+)/g ); .. do something with @percents .. }

    First, we ignore any lines outright which do not begin with the proper string ... notice the use of the "^" symbol to anchor to start-of-line, and the backslash-escaping of special symbols that otherwise would be taken as part of (ill-formed) regular expression syntax.

    Then, “the interesting bits” in the string are groupings of digits-and-decimal-points, so we gather up as many of them as are present anywhere in the line.   In so-called “array context,” Perl will return an array containing all of the values found, without using a loop to do so, although we certainly could have done so using so-called “scalar context.”   Notice the use of parentheses to indicate a substring that we wish to extract.

Re: Extracting string from a file
by Bindo (Acolyte) on Nov 19, 2013 at 02:11 UTC

    Thank you very much for all the good advices gentlemen. Guess I owe you all a big apology since I have failed to reply. I was in the hospital due to a small accident and only last night I have been discharged. Now back at feet :)

    I tried the following code but the program wont give any output nor any errors. Please can one of you correct this code for me? Please gentlemen Im a beginner who is trying to understand the whole concept of regexes more specifically with files, So be gentle.

    my $SYS_HOME = $ENV{'SYSTEM_HOME'}; my $GD_FILE = $SYS_HOME."/GD.log"; my $FH; my @DUMP; open ($FH, '<', $GD_FILE) || die "Cant open : $!"; while (my $line = $FH) { if ($line =~ /~\|(TOTAL.*)/){ #my $tmp = $1; push @DUMP, $1; foreach (@DUMP) { print "$_\n"; } } }

    Many thanks in advance! /Bindo

      It's a simple bug - you're copying the file handle rather than reading from it.
      while (my $line = $FH) {
      should instead be
      while (my $line = <$FH>) {
      The "<>" around $FH reads from the filehandle (a line at a time, in this context).

      Mike

        Thank you very much Mike. It worked. :) Really appreciate it.

Re: Extracting string from a file
by pvaldes (Chaplain) on Nov 23, 2013 at 02:00 UTC

    I have a log file with more than 10000 lines

    Not to much lines, but, your target should be, probably, to discard the unnecessary lines as soon as you can. You are doing the loop "check all files for all regexes + discard if all fails". And you could consider instead this: "next unless my first character is '\|' or what I'm expecting, and if not, take a closer look to the rest of the lines. If you are looking exactly for "horse in a meadow" and your first letter is a "p", you don't need to look further. next line.

    you can also weed out your file with grep first. Treat first the most common group of lines expected (positives or negatives for your match). Use regexes then for the difficult and rare cases. Complicated regexes are expensive.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1061986]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (12)
As of 2014-07-30 14:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (234 votes), past polls