Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

perl regex extraction from a large input file

by Anonymous Monk
on Sep 13, 2013 at 21:55 UTC ( #1054023=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi , I have a large input file which is of the form . I am developing a framework which would extract the no of Passed and Failed test cases . If the test cases is "***Passed***" then it should show on the Case-URL and Req-URL of the Passed test cases ,and similarly for the failed ones their own Case-Url and Req-URL must be shown . Please help me how do I extract that ..

Sample input file : Execution start time 09/13/2013 02:43:55 pm [Case-Url] - www.google.com [Req-URL ] - www.qtp.com ***Passed*** __________________________________________________________ [Case-Url] - www.yahoo.com [Req-URL ] - www.msn.com ***Passed*** ___________________________________________________________ [Case-Url] - www.google.com [Req-URL ] - www.qtp.com ***Failed***

Comment on perl regex extraction from a large input file
Download Code
Re: perl regex extraction from a large input file
by toolic (Chancellor) on Sep 13, 2013 at 23:47 UTC
    • Read perlintro.
    • Write some code.
    • Post again when you have more specific questions. Make sure to post your code with your expected output.
      What else do you expect ?

        What else do you expect ?

        Maybe he expects for you to put forth some effort, give it a try, because perlmonks is not a code writing service

Re: perl regex extraction from a large input file
by smls (Friar) on Sep 14, 2013 at 01:57 UTC

    This is how I would do it:

    1. Write a regex that matches every possible "test-case" block, i.e. a block of the form:
      [Case-Url] - www.google.com [Req-URL ] - www.qtp.com ***Passed***

    2. Add parenthesis to the regex, to capture the parts that should be extracted (as described in the Extracting matches section of perlretut).
      (In this example block, the parts to extract would be "www.google.com", "www.qtp.com" and "Passed")

    3. Apply the regex to the input string repeatedly, using a while loop and the /.../g regex construct (as described in the Global matching section of perlretut).
      while ($input =~ /YOUR_REGEX_GOES_HERE/gs) { print "Extracted values $1, $2, $3\n"; }

    4. Inside the while loop, do whatever it is you want to do with the extracted values.
      If you want to collect a list of all successful tests and another lists of all failed tests, you should define two corresponding arrays above the while loop, and then inside the loop add an entry to the right array on each iteration.

    If you run into further problems, report back with the regex/code you've written so far.

Re: perl regex extraction from a large input file
by CountZero (Bishop) on Sep 14, 2013 at 09:20 UTC
    Although Perlmonks is not a code writing service, sometimes it is just easier to explain what you have to do by simply writing the code, especially when your requirements are not entirely clear.

    I understand it that you have one large file with data in a certain format which you have to parse in order to get some kind of summarized results, i.e. the number of passed and failed "Case-URL" and "Req-URL" items.

    Everytime you hear "large file" you should think of reading/parsing/handling the file on a record-by-record basis. That will minimize your memory requirements.

    It also means that you have to determine the record format, more especially, the record delimiter. Sometimes the record delimiter can be as simple as a CR/LF, sometimes it is longer. In this case it is "__________________________________________________________\n"

    What you have to do is to read the file record-by-record. Fortunately Perl can do that easily: all you have to do is tell Perl what is the record delimiter and assign that to the $/ variable.

    Then you can read the file a record at a time and through the magic of regular expressions extract the data you need and update the variables with the count of the data found.

    The following is one of the ways to do this:

    # use Modern::Perl; use Data::Dump qw/dump/; local $/ = "__________________________________________________________ +\n"; my %results; while ( my $record = <DATA> ) { my $pass = $record =~ m/\Q***Passed***\E/ ? 'Passed' : 'Failed'; for my $line ( split /\n/, $record ) { next unless $line =~ /^\[/; my ( $case_req, $url ) = split /\s+-\s+/, $line; $results{$pass}{$case_req}{$url}++; } } say dump(%results); __DATA__ Execution start time 09/13/2013 02:43:55 pm [Case-Url] - www.google.com [Req-URL ] - www.qtp.com ***Passed*** __________________________________________________________ [Case-Url] - www.yahoo.com [Req-URL ] - www.msn.com ***Passed*** __________________________________________________________ [Case-Url] - www.google.com [Req-URL ] - www.qtp.com ***Failed***
    Output:
    ( "Passed", { "[Case-Url]" => { "www.google.com" => 1, "www.yahoo.com" => 1 }, "[Req-URL ]" => { "www.msn.com" => 1, "www.qtp.com" => 1 }, }, "Failed", { "[Case-Url]" => { "www.google.com" => 1 }, "[Req-URL ]" => { "www.qtp.com" => 1 }, }, )

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      The way you store the "Case-Url" and the "Req-URL" as hashes causes the loss of the link between the two. I don't know whether that is important or not. A rather simple approach should do in this case:

      while(<DATA>){ print "$1 - " if /\[.*\] - (.*)/i; print "$1\n" if /[*]+([^*]+)/; }
        Yes indeed. That is the problem with "fuzzy" requirements, there may (or not) be certain data-elements that need to be taken into account.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics
Re: perl regex extraction from a large input file
by TJPride (Pilgrim) on Sep 14, 2013 at 19:05 UTC
    I have no idea what structure you're trying to achieve here. Assuming you want counts of passed/failed case and req URL's:
    use Data::Dumper; use strict; use warnings; my (%counts, $case, $req); while (<DATA>) { chomp; if (m/\Q[Case-Url] - \E(.*)/) { $case = $1; } elsif (m/\Q[Req-URL ] - \E(.*)/) { $req = $1; } if (m/\*\*\*(Passed|Failed)\*\*\*/) { $counts{$1}{'case'}{$case}++; $counts{$1}{'req'}{$req}++; } } print Dumper(\%counts); __DATA__ Execution start time 09/13/2013 02:43:55 pm [Case-Url] - www.google.com [Req-URL ] - www.qtp.com ***Passed*** __________________________________________________________ [Case-Url] - www.yahoo.com [Req-URL ] - www.msn.com ***Passed*** ___________________________________________________________ [Case-Url] - www.google.com [Req-URL ] - www.qtp.com ***Failed***
    Output:
    $VAR1 = { 'Passed' => { 'req' => { 'www.qtp.com' => 1, 'www.msn.com' => 1 }, 'case' => { 'www.google.com' => 1, 'www.yahoo.com' => 1 } }, 'Failed' => { 'req' => { 'www.qtp.com' => 1 }, 'case' => { 'www.google.com' => 1 } } };

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1054023]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (9)
As of 2014-08-20 21:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (124 votes), past polls