Re: How best to strip text from a file?

in reply to How best to strip text from a file?

Consider the following:

use strict;
use warnings;
use Data::Dumper;

my %hash;

while (<DATA>) {
    $hash{orderID}        //= do { /Order ID:(\S+)/;        $1 };
    $hash{fiscalCycle}    //= do { /cycle:(\d+)/;           $1 };
    $hash{vendorID}       //= do { /Vendor ID:(\S+)/;       $1 };
    $hash{requisitionNum} //= do { /\s+(\d+).+requisition/; $1 };
    $hash{copies}         //= do { /copies:(\d+)/;          $1 };
    $hash{title}          //= do { /Title:(.+)/;            $1 };
    $hash{'ISBN/ISSN'}    //= do { m{ISBN/ISSN:(\S+)};      $1 };

    if (/Distribution--/) {
        my $oldDelim = $/;
        local $/ = 'Distribution--';
        
        while (<DATA>) {
            my %tempHash;

            ( $tempHash{holdingCode} )  = /code:(\S+)/;
            ( $tempHash{copies} )       = /copies:(\d+)/;
            ( $tempHash{dateReceived} ) = /received:(\S+)/;
            ( $tempHash{dateLoaded} )   = /loaded:(\S+)/;

            push @{ $hash{distribution} }, \%tempHash;
        }

        $/ = $oldDelim;
    }
}

print Dumper \%hash;

__DATA__
                             List of Distributions                    
+          
                                                                      
+          
                  Produced Tuesday, 9 October, 2012 at 1:38 PM
                                                                      
+          


       Order ID:PO-9999                  fiscal cycle:21112
      Vendor ID:VEND99                     order type:SUBSCRIPT
    15)   requisition number:                      copies:9    
                call number:XX(9999999.999)                          
                  ISBN/ISSN:9999-999X           
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-999      
            holding code:CODEINFO1                   copies:1    
           date received:27/6/2012                             date lo
+aded:27/6/2012
              
        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO3                    copies:2    
           date received:27/9/2012                             date lo
+aded:27/6/2012
              
        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO2                     copies:1    
           date received:25/8/2012                             date lo
+aded:27/6/2012
[download]

Dumper output of %hash:

$VAR1 = {
          'vendorID' => 'VEND99',
          'copies' => '9',
          'fiscalCycle' => '21112',
          'distribution' => [
                              {
                                'dateLoaded' => '27/6/2012',
                                'dateReceived' => '27/6/2012',
                                'copies' => '1',
                                'holdingCode' => 'CODEINFO1'
                              },
                              {
                                'dateLoaded' => '27/6/2012',
                                'dateReceived' => '27/9/2012',
                                'copies' => '2',
                                'holdingCode' => 'CODEINFO3'
                              },
                              {
                                'dateLoaded' => '27/6/2012',
                                'dateReceived' => '25/8/2012',
                                'copies' => '1',
                                'holdingCode' => 'CODEINFO2'
                              }
                            ],
          'ISBN/ISSN' => '9999-999X',
          'title' => 'Item title here.',
          'orderID' => 'PO-9999',
          'requisitionNum' => '15'
        };
[download]

This reads a line at a time of data, using defined-or-equals and a regex to set hash values when a match occurs. Since there are multiple distributions, the file input separator is temporarily set to 'Distribution--' when the first distribution is detected, so distribution chunks can be processed all at once. $hash{distribution} pairs to an array of hashes--one for each distribution record.

Perhaps you can set the file input separator so you read in one order at a time, process it with the above, and then write the contents of %hash to an Excel spreadsheet.

Hope this helps!

In Section Seekers of Perl Wisdom