http://www.perlmonks.org?node_id=1001906

bobdabuilda has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys. I've been mulling this one over for awhile now, trying to decide/work out how best to handle it.

One of my users has a report she would like to run which results in over 1000 pages of rather poorly-presented text. Needless to say, she would like it if I could arrange for this to be trimmed considerably. My intention is to build this up, firstly, getting the data extraction right, and then I will use Spreadsheet::WriteExcel to put it into a much more friendly format for her.

See below for an example of the kind of text I'm working with - note this is just part of one "order", and these are repeated numerous times throughout the file. All the dates etc. have the "ability" of being different, so each "order" and each "distribution" of each order need to be extracted and handled on an individual basis.

List of Distributions + + Produced Tuesday, 9 October, 2012 at 1:38 PM + + Order ID:PO-9999 fiscal cycle:21112 Vendor ID:VEND99 order type:SUBSCRIPT 15) requisition number: copies:9 call number:XX(9999999.999) ISBN/ISSN:9999-999X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO1 copies:1 date received:27/6/2012 date lo +aded:27/6/ 2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO3 copies:2 date received:27/9/2012 date lo +aded:27/6/ 2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO2 copies:1 date received:25/8/2012 date lo +aded:27/6/ 2012

So, out of that, I need to grab the values from the (in order of appearance) Order ID, fiscal cycle, Vendor ID, the number to the left of "requisition number", copies, title, ISBN/ISSN, holding code, copies, date received, date loaded.

My idea at this point in time, is to perform a loop, looking for the start of the data I need to grab. I have managed to get this part done using :

open (IN, "<$distfile") or die "Can't open $distfile\n"; print "File opened\n"; while ($line = <IN>) { chomp($line); if ( $line =~ /^(\s+.+)Order ID/ ) { print "Found Order ID\n"; } }

That part works fine. I then moved on to actually trying to extract the data from the crud. The idea was to just loop through each bracket of data, stripping the data out of each line separately.

So, for the first one, I designed a regex which matches everything except the two pieces of data I want from the "Order ID" line. The plan was to negate that match, and dump the results into a variable, then move on to the next line. Sounded relatively easy, but I've not been able to work out where I've gone wrong with it... I think if I can get a little help working out how to do this one line, then the rest of it should fall into place pretty readily...

The test I have been trying to use for this is :

my ($order,$fiscal) = ($line !~ m/(\s+Order ID:|\s+fiscal cycle:)/g) +; print "Order # $order, Fiscal year: $fiscal\n";

As it stands above, I get a string printed with null values. If I change the "!~" to "=~" then I get the output:

Order #        Order ID:, Fiscal year:                   fiscal cycle:

... which is why I was trying to negate the regex match. So... could you please help me understand where it is I'm going wrong with this? Am I going about this the right way, or should I be thinking along different lines for processing all this text?

Thanks in advance for any assistance you can give.

Replies are listed 'Best First'.
Re: How best to strip text from a file?
by Kenosis (Priest) on Nov 02, 2012 at 07:25 UTC

    Consider the following:

    use strict; use warnings; use Data::Dumper; my %hash; while (<DATA>) { $hash{orderID} //= do { /Order ID:(\S+)/; $1 }; $hash{fiscalCycle} //= do { /cycle:(\d+)/; $1 }; $hash{vendorID} //= do { /Vendor ID:(\S+)/; $1 }; $hash{requisitionNum} //= do { /\s+(\d+).+requisition/; $1 }; $hash{copies} //= do { /copies:(\d+)/; $1 }; $hash{title} //= do { /Title:(.+)/; $1 }; $hash{'ISBN/ISSN'} //= do { m{ISBN/ISSN:(\S+)}; $1 }; if (/Distribution--/) { my $oldDelim = $/; local $/ = 'Distribution--'; while (<DATA>) { my %tempHash; ( $tempHash{holdingCode} ) = /code:(\S+)/; ( $tempHash{copies} ) = /copies:(\d+)/; ( $tempHash{dateReceived} ) = /received:(\S+)/; ( $tempHash{dateLoaded} ) = /loaded:(\S+)/; push @{ $hash{distribution} }, \%tempHash; } $/ = $oldDelim; } } print Dumper \%hash; __DATA__ List of Distributions + + Produced Tuesday, 9 October, 2012 at 1:38 PM + Order ID:PO-9999 fiscal cycle:21112 Vendor ID:VEND99 order type:SUBSCRIPT 15) requisition number: copies:9 call number:XX(9999999.999) ISBN/ISSN:9999-999X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO1 copies:1 date received:27/6/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO3 copies:2 date received:27/9/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO2 copies:1 date received:25/8/2012 date lo +aded:27/6/2012

    Dumper output of %hash:

    $VAR1 = { 'vendorID' => 'VEND99', 'copies' => '9', 'fiscalCycle' => '21112', 'distribution' => [ { 'dateLoaded' => '27/6/2012', 'dateReceived' => '27/6/2012', 'copies' => '1', 'holdingCode' => 'CODEINFO1' }, { 'dateLoaded' => '27/6/2012', 'dateReceived' => '27/9/2012', 'copies' => '2', 'holdingCode' => 'CODEINFO3' }, { 'dateLoaded' => '27/6/2012', 'dateReceived' => '25/8/2012', 'copies' => '1', 'holdingCode' => 'CODEINFO2' } ], 'ISBN/ISSN' => '9999-999X', 'title' => 'Item title here.', 'orderID' => 'PO-9999', 'requisitionNum' => '15' };

    This reads a line at a time of data, using defined-or-equals and a regex to set hash values when a match occurs. Since there are multiple distributions, the file input separator is temporarily set to 'Distribution--' when the first distribution is detected, so distribution chunks can be processed all at once. $hash{distribution} pairs to an array of hashes--one for each distribution record.

    Perhaps you can set the file input separator so you read in one order at a time, process it with the above, and then write the contents of %hash to an Excel spreadsheet.

    Hope this helps!

      kenosis - thank you VERY much for that. As someone already stated, very eloquent and nice and neat, to boot.

      I've not had a chance to come back to this until now, but will hopefully have a chance in the next few days to have a "play" with it and get my head around what you're doing (nothing wrong with your code... it's my head that needs sorting out. I don't play with Perl anywhere NEAR as much as I need to for doing some of this stuff efficiently!)

      Thanks for taking the time to do such an informative and helpful response... I'm quite sure I'll be able to make very good use of this.

        You've very welcome, bobdabuilda! I hope it'll fit your needs.

        Please let me know if you have any questions about it or if you encounter any problems using it...

      That is elegant, and quite pretty as well!

        I'm honored, pemungkah. Thank you.

      I'm working on something similar, except the key/value pairs may span lines. e.g.
      FOO: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do +eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim a +d minim veniam, quis nostrud exercitation ullamco laboris nisi ut ali +quip ex ea commodo consequat. Duis aute irure dolor in reprehenderit +in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excep +teur sint occaecat cupidatat BAR: 2012
      Is there a way to make perl "explain" what the regex is doing so I can adapt this to work with my data? Also is there a way to do this without using the smart matching feature? We use old perl, a change isn't possible right now.
        use re 'debug';
      I have a similar but different problem. Say I have a file with a list of records, all have at least one field "FOO:" "BAR" and "BAZ" are optional fields. Each value may be multi line and the new lines are't consistent between variables e.g.
      FOO: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore BAR: 2012 BAZ: 1234-567-890 FOO: test BAZ: 0987-654-321 FOO: test2 BAR: 2014
      I'm having a hard time getting my head around regexes, and help would be appreciated.

        Where does one record end and the next record start?

        If FOO: marks the start of a new record, I wouldn't try to collect everything with one regular expression but go through the input line by line, and either set up a new field name into which to collect, or flush the current set of data once a new starting marker has been found:

        use strict; use Data::Dumper; my %record; sub flush { print Dumper \%record; %record = (); }; my $current; while (<DATA>) { if( /^(FOO):(.*)/ ) { flush() if keys %record; $current = $1; $record{ $current }.= $2; } elsif( /^([A-Z]+):(.*)/ ) { $current = $1; $record{ $current }.= $2; } else { $record{ $current }.= $_; }; }; flush() if keys %record; __DATA__ FOO: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore BAR: 2012 BAZ: 1234-567-890 FOO: test BAZ: 0987-654-321 FOO: test2 BAR: 2014
Re: How best to strip text from a file?
by hperange (Beadle) on Nov 02, 2012 at 03:36 UTC
    I think the regex you are looking for, in this particular case is:
    /^\s+Order ID:([a-zA-Z0-9-]+)\s+fiscal cycle:(\d+)/

    perlrequick perlre

    To approach the original problem, I think you should develop a routine which will read a record into a buffer, and have a separate routine which will handle the parsing of one record. You can then use different routines within your parsing "framework" to handle the parsing of different structures.

    Some pseudocode:
    my $in_rec = 0; my ($head_re, $tail_re) = (qr/Start of record/, qr/End of record/); my @record; while (<>) { chomp; if ($in_rec) { if (/$tail_re/) { $in_rec = 0; parse_record(@record); } push @record, $_; } else { if (/$head_re/) { $in_rec = 1; @record = (); } push @record, $_ if $in_rec; } }

    I hope this makes sense, also bear in mind this is only pseudocode, trying to demonstrate the logic I would go for, not actual parsing.

      Thanks for that! I did actually grab a copy of your pseudocode in passing when I noticed your quick reply, so I could start mulling it over, to see how best to fit it in with what I'm doing.

      I think that that, combined with the code below, I should be able to sort something out. Thanks again :)

Re: How best to strip text from a file?
by Anonymous Monk on Nov 02, 2012 at 03:18 UTC
Re: How best to strip text from a file?
by sundialsvc4 (Abbot) on Nov 02, 2012 at 13:58 UTC

    The tool that was the original inspiration for Perl, awk, points to what might be a good, generalized way to approach this problem.

    Consider the inputs that you have, one line at a time.   Each line of data is recognizable as being of a particular type, by conforming to some pattern.   For each line-type, the “interesting bits” can also readily be found by a line-type specific regular expression.   You can see that this sort of problem could be described in terms of a list of regular-expressions which are to be tested one-by-one (thus identifying the line type), with a short block of code to describe what to do with each one.   Which is exactly the approach that awk does take, and which a Perl program also can take.

Re: How best to strip text from a file?
by Disharum (Initiate) on Nov 05, 2012 at 15:20 UTC
    In addition to the useful comments submitted by other monks, you may want to look into using MARC Utilities, a tool built in Perl/Tk which handles the kind of data you are dealing with. Check the function that converts Sirsi records into tabular format. Best of luck.
Re: How best to strip text from a file?
by Disharum (Initiate) on Nov 06, 2012 at 12:13 UTC
    Oh, and another option is to have the person running the report export the data in XML format which can easily be manipulated in Excel. That way you do not have to intervene at all :-}

      Ahhh if only life were so simple ;)

      Nice spotting on the Sirsi side of things, by the way, and thanks for the suggestions :)