Re: How best to strip text from a file?

Consider the following:

use strict;
use warnings;
use Data::Dumper;

my %hash;

while (<DATA>) {
    $hash{orderID}        //= do { /Order ID:(\S+)/;        $1 };
    $hash{fiscalCycle}    //= do { /cycle:(\d+)/;           $1 };
    $hash{vendorID}       //= do { /Vendor ID:(\S+)/;       $1 };
    $hash{requisitionNum} //= do { /\s+(\d+).+requisition/; $1 };
    $hash{copies}         //= do { /copies:(\d+)/;          $1 };
    $hash{title}          //= do { /Title:(.+)/;            $1 };
    $hash{'ISBN/ISSN'}    //= do { m{ISBN/ISSN:(\S+)};      $1 };

    if (/Distribution--/) {
        my $oldDelim = $/;
        local $/ = 'Distribution--';
        
        while (<DATA>) {
            my %tempHash;

            ( $tempHash{holdingCode} )  = /code:(\S+)/;
            ( $tempHash{copies} )       = /copies:(\d+)/;
            ( $tempHash{dateReceived} ) = /received:(\S+)/;
            ( $tempHash{dateLoaded} )   = /loaded:(\S+)/;

            push @{ $hash{distribution} }, \%tempHash;
        }

        $/ = $oldDelim;
    }
}

print Dumper \%hash;

__DATA__
                             List of Distributions                    
+          
                                                                      
+          
                  Produced Tuesday, 9 October, 2012 at 1:38 PM
                                                                      
+          


       Order ID:PO-9999                  fiscal cycle:21112
      Vendor ID:VEND99                     order type:SUBSCRIPT
    15)   requisition number:                      copies:9    
                call number:XX(9999999.999)                          
                  ISBN/ISSN:9999-999X           
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-999      
            holding code:CODEINFO1                   copies:1    
           date received:27/6/2012                             date lo
+aded:27/6/2012
              
        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO3                    copies:2    
           date received:27/9/2012                             date lo
+aded:27/6/2012
              
        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO2                     copies:1    
           date received:25/8/2012                             date lo
+aded:27/6/2012
[download]

Dumper output of %hash:

$VAR1 = {
          'vendorID' => 'VEND99',
          'copies' => '9',
          'fiscalCycle' => '21112',
          'distribution' => [
                              {
                                'dateLoaded' => '27/6/2012',
                                'dateReceived' => '27/6/2012',
                                'copies' => '1',
                                'holdingCode' => 'CODEINFO1'
                              },
                              {
                                'dateLoaded' => '27/6/2012',
                                'dateReceived' => '27/9/2012',
                                'copies' => '2',
                                'holdingCode' => 'CODEINFO3'
                              },
                              {
                                'dateLoaded' => '27/6/2012',
                                'dateReceived' => '25/8/2012',
                                'copies' => '1',
                                'holdingCode' => 'CODEINFO2'
                              }
                            ],
          'ISBN/ISSN' => '9999-999X',
          'title' => 'Item title here.',
          'orderID' => 'PO-9999',
          'requisitionNum' => '15'
        };
[download]

This reads a line at a time of data, using defined-or-equals and a regex to set hash values when a match occurs. Since there are multiple distributions, the file input separator is temporarily set to 'Distribution--' when the first distribution is detected, so distribution chunks can be processed all at once. $hash{distribution} pairs to an array of hashes--one for each distribution record.

Perhaps you can set the file input separator so you read in one order at a time, process it with the above, and then write the contents of %hash to an Excel spreadsheet.

Hope this helps!

Comment on Re: How best to strip text from a file? Select or Download Code

Replies are listed 'Best First'.
Re^2: How best to strip text from a file? by bobdabuilda (Beadle) on Nov 07, 2012 at 02:40 UTC
kenosis - thank you VERY much for that. As someone already stated, very eloquent and nice and neat, to boot. I've not had a chance to come back to this until now, but will hopefully have a chance in the next few days to have a "play" with it and get my head around what you're doing (nothing wrong with your code... it's my head that needs sorting out. I don't play with Perl anywhere NEAR as much as I need to for doing some of this stuff efficiently!) Thanks for taking the time to do such an informative and helpful response... I'm quite sure I'll be able to make very good use of this.	[reply]
Re^3: How best to strip text from a file? by Kenosis (Priest) on Nov 07, 2012 at 03:18 UTC
You've very welcome, bobdabuilda! I hope it'll fit your needs. Please let me know if you have any questions about it or if you encounter any problems using it...	[reply]
Re^4: How best to strip text from a file? by bobdabuilda (Beadle) on Nov 07, 2012 at 22:14 UTC
Well, I did get a chance to look at it yesterday before I headed home, and realised I didn't give as much example data as I should have - there are usually numerous Orders containing the multiple distributions... so I'm going to hav a play with the logic today, hopefully, to work out how to perform that loop... The quick look I had at it got me there, to a point - but "lost" the first line of each subsequent order due to the way I had the loops set up... should hopefully be able to get that right today... but your code has certainly put me well and truly on the way to what I was after, and I'm very thankful for that :)	[reply]
Re^5: How best to strip text from a file? by Kenosis (Priest) on Nov 07, 2012 at 23:31 UTC
Re^6: How best to strip text from a file? by bobdabuilda (Beadle) on Nov 08, 2012 at 01:29 UTC
Some notes below your chosen depth have not been shown here
Re^2: How best to strip text from a file? by pemungkah (Priest) on Nov 02, 2012 at 22:32 UTC
That is elegant, and quite pretty as well!	[reply]
Re^3: How best to strip text from a file? by Kenosis (Priest) on Nov 02, 2012 at 22:36 UTC
I'm honored, pemungkah. Thank you.	[reply]
Re^2: How best to strip text from a file? by Anonymous Monk on Nov 05, 2012 at 10:31 UTC
I'm working on something similar, except the key/value pairs may span lines. e.g. `FOO: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do +eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim a +d minim veniam, quis nostrud exercitation ullamco laboris nisi ut ali +quip ex ea commodo consequat. Duis aute irure dolor in reprehenderit +in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excep +teur sint occaecat cupidatat BAR: 2012` [download] Is there a way to make perl "explain" what the regex is doing so I can adapt this to work with my data? Also is there a way to do this without using the smart matching feature? We use old perl, a change isn't possible right now.	[reply] [d/l]
Re^3: How best to strip text from a file? by Anonymous Monk on Nov 05, 2012 at 10:54 UTC
use re 'debug';	[reply]
Re^2: How best to strip text from a file? by Anonymous Monk on Nov 05, 2012 at 14:38 UTC
I have a similar but different problem. Say I have a file with a list of records, all have at least one field "FOO:" "BAR" and "BAZ" are optional fields. Each value may be multi line and the new lines are't consistent between variables e.g. `FOO: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore BAR: 2012 BAZ: 1234-567-890 FOO: test BAZ: 0987-654-321 FOO: test2 BAR: 2014` [download] I'm having a hard time getting my head around regexes, and help would be appreciated.	[reply] [d/l]
Re^3: How best to strip text from a file? by Corion (Patriarch) on Nov 05, 2012 at 14:49 UTC
Where does one record end and the next record start? If `FOO:` marks the start of a new record, I wouldn't try to collect everything with one regular expression but go through the input line by line, and either set up a new field name into which to collect, or flush the current set of data once a new starting marker has been found: use strict; use Data::Dumper; my %record; sub flush { print Dumper \%record; %record = (); }; my $current; while (<DATA>) { if( /^(FOO):(.)/ ) { flush() if keys %record; $current = $1; $record{ $current }.= $2; } elsif( /^([A-Z]+):(.)/ ) { $current = $1; $record{ $current }.= $2; } else { $record{ $current }.= $_; }; }; flush() if keys %record; __DATA__ FOO: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore BAR: 2012 BAZ: 1234-567-890 FOO: test BAZ: 0987-654-321 FOO: test2 BAR: 2014 [download]	[reply] [d/l] [select]


Your skill will accomplish what the force of many cannot
	PerlMonks