Text Extraction

JonDepp has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Text Extraction by toolic (Bishop) on Feb 05, 2010 at 15:31 UTC
There is probably a clever way to do it with grep and Range Operators, but here is a way using state variables. You should change the flags to more meaningful names for your application: `use strict; use warnings; my $flag1 = 0; my $flag2 = 0; my @lines; while (<DATA>) { $flag2 = 1 if $flag1 and /A1/; $flag1 = 1 if /SUBSCRIBER/; push @lines, $_ if $flag1; if (/NATIONAL/) { print @lines if $flag2; $flag1 = 0; $flag2 = 0; @lines = (); } } __DATA__ foo bar SUBSCRIBER goo hoo nada NATIONAL SUBSCRIBER goo A1 NATIONAL junk junk junk` [download] Prints: `SUBSCRIBER goo A1 NATIONAL` [download] Update: Ok, here's my solution with Range Operators: `use strict; use warnings; my $flag = 0; my @lines; while (<DATA>) { if (/SUBSCRIBER/ .. /NATIONAL/) { push @lines, $_; $flag = 1 if /A1/; if (/NATIONAL/) { print @lines if $flag; $flag = 0; @lines = (); } } }` [download]	[reply] [d/l] [select]
Re^2: Text Extraction by JonDepp (Novice) on Feb 08, 2010 at 19:16 UTC
Here is an example of my input file and these follow the same structure over and over: SUBSCRIBER DEMOGRAPHIC INFORMATION BIRTHDATE GENDER MEMBER IDENTIFICATION NUMBER NAME XXXXXXXXX TRACE NUMBER: XXXXXXXX CLAIM CLAIM PAYORS CLAIM NUMBER: XXXXXXXXXXXXXXXXXXXX PERIOD BEG PERIOD END MEDICAL RECORD NUMBER: 01/14/2010 01/14/2010 BILLING TYPE: EFFECTIVE ADJUDICATION PAYMENT CHARGE PAYMENT CHECK STATUS DATE PAYMENT DATE METHOD AMOUNT AMOUNT CHECK DATE NUMBER 02/01/2010 XX.XX 0.00 CLAIM LEVEL STATUS CATEGORY: A1 STATUS: 19 MODIFIER: PR PAGE: 11 CLINIC # XXXXXX (C980 ) XXXXXX REPORT NO: CPR601.01 SOMEINSURANCE HEALTH CARE CLAIM STATUS NOTIFICATION ISA CONTROL NO: XXXXXXXXXX ISA PROCESS DATE: 10/02/02 ISA PROCESS TIME: 04:52 GROUP CONTROL NO: XXXXXX ST CONTROL NO: XXXXXXXXX BHT REFERENCE ID: XXXXXXXXX BHT DATE: 02/02/2010 PAYOR NAME: SOMEINSURANCE ID: XXXXX PROVIDER NAME: XXXXXXXXXXX XXXXXXXXX XXXXX XX NATIONAL PROVIDER ID: XXXXXXXXXX I need everythin between SUBSCRIBER DEMOGRAPHIC - NATIONAL PROVIDER ID only if the CLAIM STATUS CATEGORY CODE is other than A1 (A3, A4, F2...there's a bunch). Here is the code I have so far. `use strict; use warnings; open TEST, "tests.txt" or die $!; open OUTPUT, "> output1.txt" or die$!; my @data; my $data; while (<TEST>) { if (/SUBSCRIBER DEMOGRAPHIC/../CLAIM LEVEL STATUS CATEGORY/) { @data = $_; next; foreach ( $data, @data) { if ($data =~ /A1/) { print OUTPUT @data; } } } } close TEST; close OUTPUT;` [download] This code gets me no errors in syntax when I run it but I get 0 KB output file. Please Help!!	[reply] [d/l]
Re^2: Text Extraction by JonDepp (Novice) on Feb 05, 2010 at 16:44 UTC
That worked a lot better. I just realized that the text file I'm parsing has those regular expressions occurring all over the place so I have to refine the ones in my code. This is a great start and I'm sure I'll be back once I refine those expressions. Thanks for all the help!!	[reply]
Re^3: Text Extraction by planetscape (Chancellor) on Feb 06, 2010 at 15:04 UTC
You might benefit from something in My Favourite Regex Tools. HTH, planetscape	[reply]
Re: Text Extraction by holli (Abbot) on Feb 05, 2010 at 15:01 UTC
You are outputting the wrong thing. Try `print OUTPUT $line;` [download] holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]
Re^2: Text Extraction by JonDepp (Novice) on Feb 05, 2010 at 15:14 UTC
That does print out the one line to output, but if the A1 code matches I want all the text between SUBSCRIBER and NATIONAL to print to output.	[reply]
Re: Text Extraction by jwkrahn (Abbot) on Feb 05, 2010 at 17:06 UTC
This should be close to what you require: `#!/usr/bin/perl use strict; use warnings; open TEST, '<', 'tests.txt' or die "Cannot open 'tests.txt' $!"; open OUTPUT, '>', 'output1.txt' or die "Cannot open 'output1.txt' $!"; my $data; while ( <TEST> ) { if ( /SUBSCRIBER/ .. /NATIONAL/ ) { $data .= $_; next; } if ( length $data && $data =~ /A1/ ) { print OUTPUT $data; $data = ''; } } close TEST; close OUTPUT;` [download]	[reply] [d/l]
Re^2: Text Extraction by JonDepp (Novice) on Feb 05, 2010 at 21:17 UTC
This works great for the tests.txt file, but whenever I change the input file to something else I get an unitialized value $data in pattern match line 16 error, and I don't know why. Any help is greatly appreciated!!	[reply]


Just another Perl shrine
	PerlMonks