perl regex extraction from a large input file

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: perl regex extraction from a large input file by toolic (Bishop) on Sep 13, 2013 at 23:47 UTC
Read perlintro. Write some code. Post again when you have more specific questions. Make sure to post your code with your expected output.	[reply]
Re^2: perl regex extraction from a large input file by Anonymous Monk on Sep 13, 2013 at 23:51 UTC
What else do you expect ?	[reply]
Re^3: perl regex extraction from a large input file by Anonymous Monk on Sep 14, 2013 at 00:36 UTC
What else do you expect ? Maybe he expects for you to put forth some effort, give it a try, because perlmonks is not a code writing service	[reply]
Re: perl regex extraction from a large input file by CountZero (Bishop) on Sep 14, 2013 at 09:20 UTC
Although Perlmonks is not a code writing service, sometimes it is just easier to explain what you have to do by simply writing the code, especially when your requirements are not entirely clear. I understand it that you have one large file with data in a certain format which you have to parse in order to get some kind of summarized results, i.e. the number of passed and failed "Case-URL" and "Req-URL" items. Everytime you hear "large file" you should think of reading/parsing/handling the file on a record-by-record basis. That will minimize your memory requirements. It also means that you have to determine the record format, more especially, the record delimiter. Sometimes the record delimiter can be as simple as a CR/LF, sometimes it is longer. In this case it is "`__________________________________________________________\n`" What you have to do is to read the file record-by-record. Fortunately Perl can do that easily: all you have to do is tell Perl what is the record delimiter and assign that to the `$/` variable. Then you can read the file a record at a time and through the magic of regular expressions extract the data you need and update the variables with the count of the data found. The following is one of the ways to do this: # use Modern::Perl; use Data::Dump qw/dump/; local $/ = "__________________________________________________________ +\n"; my %results; while ( my $record = <DATA> ) { my $pass = $record =~ m/\Q*Passed\E/ ? 'Passed' : 'Failed'; for my $line ( split /\n/, $record ) { next unless $line =~ /^\[/; my ( $case_req, $url ) = split /\s+-\s+/, $line; $results{$pass}{$case_req}{$url}++; } } say dump(%results); __DATA__ Execution start time 09/13/2013 02:43:55 pm [Case-Url] - www.google.com [Req-URL ] - www.qtp.com Passed* __________________________________________________________ [Case-Url] - www.yahoo.com [Req-URL ] - www.msn.com *Passed* __________________________________________________________ [Case-Url] - www.google.com [Req-URL ] - www.qtp.com *Failed* [download] Output: `( "Passed", { "[Case-Url]" => { "www.google.com" => 1, "www.yahoo.com" => 1 }, "[Req-URL ]" => { "www.msn.com" => 1, "www.qtp.com" => 1 }, }, "Failed", { "[Case-Url]" => { "www.google.com" => 1 }, "[Req-URL ]" => { "www.qtp.com" => 1 }, }, )` [download] CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply] [d/l] [select]
Re^2: perl regex extraction from a large input file by hdb (Monsignor) on Sep 14, 2013 at 09:51 UTC
The way you store the "Case-Url" and the "Req-URL" as hashes causes the loss of the link between the two. I don't know whether that is important or not. A rather simple approach should do in this case: `while(<DATA>){ print "$1 - " if /\[.\] - (.)/i; print "$1\n" if /[]+([^]+)/; }` [download]	[reply] [d/l]
Re^3: perl regex extraction from a large input file by CountZero (Bishop) on Sep 14, 2013 at 10:20 UTC
Yes indeed. That is the problem with "fuzzy" requirements, there may (or not) be certain data-elements that need to be taken into account. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re^4: perl regex extraction from a large input file by hdb (Monsignor) on Sep 14, 2013 at 10:30 UTC
Re: perl regex extraction from a large input file by smls (Friar) on Sep 14, 2013 at 01:57 UTC
This is how I would do it: Write a regex that matches every possible "test-case" block, i.e. a block of the form: `[Case-Url] - www.google.com [Req-URL ] - www.qtp.com *Passed` [download] Add parenthesis to the regex, to capture the parts that should be extracted (as described in the Extracting matches section of perlretut). (In this example block, the parts to extract would be "www.google.com", "www.qtp.com" and "Passed")* Apply the regex to the input string repeatedly, using a `while` loop and the `/.../g` regex construct (as described in the Global matching section of perlretut). `while ($input =~ /YOUR_REGEX_GOES_HERE/gs) { print "Extracted values $1, $2, $3\n"; }` [download] Inside the while loop, do whatever it is you want to do with the extracted values. If you want to collect a list of all successful tests and another lists of all failed tests, you should define two corresponding arrays above the `while` loop, and then inside the loop add an entry to the right array on each iteration. If you run into further problems, report back with the regex/code you've written so far.	[reply] [d/l] [select]
Re: perl regex extraction from a large input file by TJPride (Pilgrim) on Sep 14, 2013 at 19:05 UTC
I have no idea what structure you're trying to achieve here. Assuming you want counts of passed/failed case and req URL's: use Data::Dumper; use strict; use warnings; my (%counts, $case, $req); while (<DATA>) { chomp; if (m/\Q[Case-Url] - \E(.)/) { $case = $1; } elsif (m/\Q[Req-URL ] - \E(.)/) { $req = $1; } if (m/\\\(Passed\|Failed)\\\/) { $counts{$1}{'case'}{$case}++; $counts{$1}{'req'}{$req}++; } } print Dumper(\%counts); __DATA__ Execution start time 09/13/2013 02:43:55 pm [Case-Url] - www.google.com [Req-URL ] - www.qtp.com *Passed* __________________________________________________________ [Case-Url] - www.yahoo.com [Req-URL ] - www.msn.com *Passed* ___________________________________________________________ [Case-Url] - www.google.com [Req-URL ] - www.qtp.com *Failed* [download] Output: `$VAR1 = { 'Passed' => { 'req' => { 'www.qtp.com' => 1, 'www.msn.com' => 1 }, 'case' => { 'www.google.com' => 1, 'www.yahoo.com' => 1 } }, 'Failed' => { 'req' => { 'www.qtp.com' => 1 }, 'case' => { 'www.google.com' => 1 } } };` [download]	[reply] [d/l] [select]


Your skill will accomplish what the force of many cannot
	PerlMonks