Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Text Extraction

by JonDepp (Novice)
on Feb 05, 2010 at 14:47 UTC ( [id://821575]=perlquestion: print w/replies, xml ) Need Help??

JonDepp has asked for the wisdom of the Perl Monks concerning the following question:

Hey all, I am new to perl and am writing a text parser to wittle a large text file down and extract peices I need to an output file. I have code written that extracts the part I need but it is also extracting everything else in the file. It also seems to be running through the file a few times as information is repeated in my output file. Here is my code:

use strict; use warnings; open TEST, "tests.txt" or die $!; open OUTPUT, "> output1.txt" or die$!; my @file = grep/SUBSCRIBER/../NATIONAL/, <TEST>; my $line; foreach $line(@file) { if ($line=~m/A1/) { print OUTPUT @file; } } close TEST; close OUTPUT;

I want the output file to just have the text located between SUBSCRIBER and NATIONAL if there is an A1 code on one of the lines.

Thanks!

Replies are listed 'Best First'.
Re: Text Extraction
by toolic (Bishop) on Feb 05, 2010 at 15:31 UTC
    There is probably a clever way to do it with grep and Range Operators, but here is a way using state variables. You should change the flags to more meaningful names for your application:
    use strict; use warnings; my $flag1 = 0; my $flag2 = 0; my @lines; while (<DATA>) { $flag2 = 1 if $flag1 and /A1/; $flag1 = 1 if /SUBSCRIBER/; push @lines, $_ if $flag1; if (/NATIONAL/) { print @lines if $flag2; $flag1 = 0; $flag2 = 0; @lines = (); } } __DATA__ foo bar SUBSCRIBER goo hoo nada NATIONAL SUBSCRIBER goo A1 NATIONAL junk junk junk
    Prints:
    SUBSCRIBER goo A1 NATIONAL

    Update: Ok, here's my solution with Range Operators:

    use strict; use warnings; my $flag = 0; my @lines; while (<DATA>) { if (/SUBSCRIBER/ .. /NATIONAL/) { push @lines, $_; $flag = 1 if /A1/; if (/NATIONAL/) { print @lines if $flag; $flag = 0; @lines = (); } } }

      Here is an example of my input file and these follow the same structure over and over:

      SUBSCRIBER DEMOGRAPHIC INFORMATION

      BIRTHDATE GENDER MEMBER IDENTIFICATION NUMBER

      NAME

      XXXXXXXXX

      TRACE NUMBER: XXXXXXXX

      CLAIM CLAIM PAYORS CLAIM NUMBER: XXXXXXXXXXXXXXXXXXXX

      PERIOD BEG PERIOD END MEDICAL RECORD NUMBER: 01/14/2010 01/14/2010 BILLING TYPE:

      EFFECTIVE ADJUDICATION PAYMENT CHARGE PAYMENT CHECK STATUS DATE PAYMENT DATE METHOD AMOUNT AMOUNT CHECK DATE NUMBER 02/01/2010 XX.XX 0.00

      CLAIM LEVEL STATUS CATEGORY: A1 STATUS: 19

      MODIFIER: PR PAGE: 11 CLINIC # XXXXXX (C980 ) XXXXXX REPORT NO: CPR601.01 SOMEINSURANCE HEALTH CARE CLAIM STATUS NOTIFICATION ISA CONTROL NO: XXXXXXXXXX ISA PROCESS DATE: 10/02/02 ISA PROCESS TIME: 04:52 GROUP CONTROL NO: XXXXXX ST CONTROL NO: XXXXXXXXX BHT REFERENCE ID: XXXXXXXXX BHT DATE: 02/02/2010 PAYOR NAME: SOMEINSURANCE ID: XXXXX PROVIDER NAME: XXXXXXXXXXX XXXXXXXXX XXXXX XX

      NATIONAL PROVIDER ID: XXXXXXXXXX

      I need everythin between SUBSCRIBER DEMOGRAPHIC - NATIONAL PROVIDER ID only if the CLAIM STATUS CATEGORY CODE is other than A1 (A3, A4, F2...there's a bunch).

      Here is the code I have so far.

      use strict; use warnings; open TEST, "tests.txt" or die $!; open OUTPUT, "> output1.txt" or die$!; my @data; my $data; while (<TEST>) { if (/SUBSCRIBER DEMOGRAPHIC/../CLAIM LEVEL STATUS CATEGORY/) { @data = $_; next; foreach ( $data, @data) { if ($data =~ /A1/) { print OUTPUT @data; } } } } close TEST; close OUTPUT;

      This code gets me no errors in syntax when I run it but I get 0 KB output file. Please Help!!

      That worked a lot better. I just realized that the text file I'm parsing has those regular expressions occurring all over the place so I have to refine the ones in my code. This is a great start and I'm sure I'll be back once I refine those expressions. Thanks for all the help!!

Re: Text Extraction
by holli (Abbot) on Feb 05, 2010 at 15:01 UTC
    You are outputting the wrong thing. Try
    print OUTPUT $line;


    holli

    You can lead your users to water, but alas, you cannot drown them.

      That does print out the one line to output, but if the A1 code matches I want all the text between SUBSCRIBER and NATIONAL to print to output.

Re: Text Extraction
by jwkrahn (Abbot) on Feb 05, 2010 at 17:06 UTC

    This should be close to what you require:

    #!/usr/bin/perl use strict; use warnings; open TEST, '<', 'tests.txt' or die "Cannot open 'tests.txt' $!"; open OUTPUT, '>', 'output1.txt' or die "Cannot open 'output1.txt' $!"; my $data; while ( <TEST> ) { if ( /SUBSCRIBER/ .. /NATIONAL/ ) { $data .= $_; next; } if ( length $data && $data =~ /A1/ ) { print OUTPUT $data; $data = ''; } } close TEST; close OUTPUT;

      This works great for the tests.txt file, but whenever I change the input file to something else I get an unitialized value $data in pattern match line 16 error, and I don't know why. Any help is greatly appreciated!!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://821575]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2024-04-19 20:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found