Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: How best to strip text from a file?

by Kenosis (Priest)
on Nov 02, 2012 at 07:25 UTC ( #1001926=note: print w/ replies, xml ) Need Help??


in reply to How best to strip text from a file?

Consider the following:

use strict; use warnings; use Data::Dumper; my %hash; while (<DATA>) { $hash{orderID} //= do { /Order ID:(\S+)/; $1 }; $hash{fiscalCycle} //= do { /cycle:(\d+)/; $1 }; $hash{vendorID} //= do { /Vendor ID:(\S+)/; $1 }; $hash{requisitionNum} //= do { /\s+(\d+).+requisition/; $1 }; $hash{copies} //= do { /copies:(\d+)/; $1 }; $hash{title} //= do { /Title:(.+)/; $1 }; $hash{'ISBN/ISSN'} //= do { m{ISBN/ISSN:(\S+)}; $1 }; if (/Distribution--/) { my $oldDelim = $/; local $/ = 'Distribution--'; while (<DATA>) { my %tempHash; ( $tempHash{holdingCode} ) = /code:(\S+)/; ( $tempHash{copies} ) = /copies:(\d+)/; ( $tempHash{dateReceived} ) = /received:(\S+)/; ( $tempHash{dateLoaded} ) = /loaded:(\S+)/; push @{ $hash{distribution} }, \%tempHash; } $/ = $oldDelim; } } print Dumper \%hash; __DATA__ List of Distributions + + Produced Tuesday, 9 October, 2012 at 1:38 PM + Order ID:PO-9999 fiscal cycle:21112 Vendor ID:VEND99 order type:SUBSCRIPT 15) requisition number: copies:9 call number:XX(9999999.999) ISBN/ISSN:9999-999X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO1 copies:1 date received:27/6/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO3 copies:2 date received:27/9/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO2 copies:1 date received:25/8/2012 date lo +aded:27/6/2012

Dumper output of %hash:

$VAR1 = { 'vendorID' => 'VEND99', 'copies' => '9', 'fiscalCycle' => '21112', 'distribution' => [ { 'dateLoaded' => '27/6/2012', 'dateReceived' => '27/6/2012', 'copies' => '1', 'holdingCode' => 'CODEINFO1' }, { 'dateLoaded' => '27/6/2012', 'dateReceived' => '27/9/2012', 'copies' => '2', 'holdingCode' => 'CODEINFO3' }, { 'dateLoaded' => '27/6/2012', 'dateReceived' => '25/8/2012', 'copies' => '1', 'holdingCode' => 'CODEINFO2' } ], 'ISBN/ISSN' => '9999-999X', 'title' => 'Item title here.', 'orderID' => 'PO-9999', 'requisitionNum' => '15' };

This reads a line at a time of data, using defined-or-equals and a regex to set hash values when a match occurs. Since there are multiple distributions, the file input separator is temporarily set to 'Distribution--' when the first distribution is detected, so distribution chunks can be processed all at once. $hash{distribution} pairs to an array of hashes--one for each distribution record.

Perhaps you can set the file input separator so you read in one order at a time, process it with the above, and then write the contents of %hash to an Excel spreadsheet.

Hope this helps!


Comment on Re: How best to strip text from a file?
Select or Download Code
Re^2: How best to strip text from a file?
by pemungkah (Priest) on Nov 02, 2012 at 22:32 UTC
    That is elegant, and quite pretty as well!

      I'm honored, pemungkah. Thank you.

Re^2: How best to strip text from a file?
by Anonymous Monk on Nov 05, 2012 at 10:31 UTC
    I'm working on something similar, except the key/value pairs may span lines. e.g.
    FOO: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do +eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim a +d minim veniam, quis nostrud exercitation ullamco laboris nisi ut ali +quip ex ea commodo consequat. Duis aute irure dolor in reprehenderit +in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excep +teur sint occaecat cupidatat BAR: 2012
    Is there a way to make perl "explain" what the regex is doing so I can adapt this to work with my data? Also is there a way to do this without using the smart matching feature? We use old perl, a change isn't possible right now.
      use re 'debug';
Re^2: How best to strip text from a file?
by Anonymous Monk on Nov 05, 2012 at 14:38 UTC
    I have a similar but different problem. Say I have a file with a list of records, all have at least one field "FOO:" "BAR" and "BAZ" are optional fields. Each value may be multi line and the new lines are't consistent between variables e.g.
    FOO: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore BAR: 2012 BAZ: 1234-567-890 FOO: test BAZ: 0987-654-321 FOO: test2 BAR: 2014
    I'm having a hard time getting my head around regexes, and help would be appreciated.

      Where does one record end and the next record start?

      If FOO: marks the start of a new record, I wouldn't try to collect everything with one regular expression but go through the input line by line, and either set up a new field name into which to collect, or flush the current set of data once a new starting marker has been found:

      use strict; use Data::Dumper; my %record; sub flush { print Dumper \%record; %record = (); }; my $current; while (<DATA>) { if( /^(FOO):(.*)/ ) { flush() if keys %record; $current = $1; $record{ $current }.= $2; } elsif( /^([A-Z]+):(.*)/ ) { $current = $1; $record{ $current }.= $2; } else { $record{ $current }.= $_; }; }; flush() if keys %record; __DATA__ FOO: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore BAR: 2012 BAZ: 1234-567-890 FOO: test BAZ: 0987-654-321 FOO: test2 BAR: 2014
Re^2: How best to strip text from a file?
by bobdabuilda (Sexton) on Nov 07, 2012 at 02:40 UTC

    kenosis - thank you VERY much for that. As someone already stated, very eloquent and nice and neat, to boot.

    I've not had a chance to come back to this until now, but will hopefully have a chance in the next few days to have a "play" with it and get my head around what you're doing (nothing wrong with your code... it's my head that needs sorting out. I don't play with Perl anywhere NEAR as much as I need to for doing some of this stuff efficiently!)

    Thanks for taking the time to do such an informative and helpful response... I'm quite sure I'll be able to make very good use of this.

      You've very welcome, bobdabuilda! I hope it'll fit your needs.

      Please let me know if you have any questions about it or if you encounter any problems using it...

        Well, I did get a chance to look at it yesterday before I headed home, and realised I didn't give as much example data as I should have - there are usually numerous Orders containing the multiple distributions... so I'm going to hav a play with the logic today, hopefully, to work out how to perform that loop...

        The quick look I had at it got me there, to a point - but "lost" the first line of each subsequent order due to the way I had the loops set up... should hopefully be able to get that right today... but your code has certainly put me well and truly on the way to what I was after, and I'm very thankful for that :)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1001926]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2014-10-02 09:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    What is your favourite meta-syntactic variable name?














    Results (52 votes), past polls