comment on

Hi, bobdabuilda

You've given this much thought, and I think you're pseudocode is on target.

The orders are only separated by a blank line, but they all start wth the "Order ID:" text, so looking at using that as the separator.

The "Order ID:" as record separator makes sense.

The page header should be automatically filtered out by the regex the way it stands anyway... I think.

You're correct.

I've taken the liberty to implement an interpretation of this. It does use two loops, but the outer loop is a for loop that iterates over an array of Order records:

use strict;
use warnings;
use Data::Dumper;

# Place a filename into $recordsFile to read Orders from that file
#  else the Orders below __DATA__ will be used for demo purposes
my $recordsFile = '';

my ( @records, @orders );
my $recSeparator = 'Order ID:';

# Orders will initially be array elements 1 .. n in @orders; element 0
+ is initially the first page header
{
    # Set the record separator
    local $/ = $recSeparator;

    # If there's a file name, try to read from that file
    if ($recordsFile) {
        open my $fh, '<', $recordsFile or die $!;
        @records = <$fh>;
        close $fh;
    }
    else {
        @records = <DATA>;
    }
}

# Remove the first page header
shift @records;

# Add Order ID: back into each record for later matching
$_ = "$recSeparator$_" for @records;

# Iterate through each record (Order)
for my $record (@records) {
    my %hash;

    # Treat the record string like a file, opening it for reading
    open my $sh, '<', \$record or die "Unable to open record string: $
+!";

    # Read the string like a file, one line at a time now
    while (<$sh>) {
        $hash{orderID}        //= do { /Order ID:(\S+)/;        $1 };
        $hash{fiscalCycle}    //= do { /cycle:(\d+)/;           $1 };
        $hash{vendorID}       //= do { /Vendor ID:(\S+)/;       $1 };
        $hash{requisitionNum} //= do { /\s+(\d+).+requisition/; $1 };
        $hash{copies}         //= do { /copies:(\d+)/;          $1 };
        $hash{title}          //= do { /Title:(.+)/;            $1 };
        $hash{'ISBN/ISSN'}    //= do { m{ISBN/ISSN:(\S+)};      $1 };

        # Distributions started?
        if (/Distribution--/) {

            # Save the current record separator
            my $oldRecSeparator = $/;

            # Set a new record separator
            local $/ = 'Distribution--';

            # Read the string like a file, a distribution 'chunk' at a
+ time
            while (<$sh>) {
                my %tempHash;

                ( $tempHash{holdingCode} )  = /code:(\S+)/;
                ( $tempHash{copies} )       = /copies:(\d+)/;
                ( $tempHash{dateReceived} ) = /received:(\S+)/;
                ( $tempHash{dateLoaded} )   = /loaded:(\S+)/;

                push @{ $hash{distribution} }, \%tempHash;
            }

            # Restore the old record separator
            $/ = $oldRecSeparator;
        }
    }

    # Work with the filled-in %hash by sending a reference to it to a 
+subroutine
    # This is a complete record
    writeToSpreadSheet( \%hash );
    
    print Dumper \%hash;

    # Done 'reading' the string
    close $sh;
}


# Printing in a subroutine's not a good idea, but done here only to sh
+ow how to access the hash
sub writeToSpreadSheet {
    my ($hashReference) = @_;

    # The $$ notation dereferences the hash reference
    print $$hashReference{vendorID}, "\n";

    # The @{} notation deferences the array reference; the arrow opera
+tor deferences to get hash value
    for my $distribution ( @{ $$hashReference{distribution} } ) {
        print $distribution->{holdingCode}, "\n";
    }

    print "\n";
}

__DATA__
                             List of Distributions                    
+          
                                                                      
+          
                  Produced Tuesday, 9 October, 2012 at 1:38 PM
                                                                      
+          


       Order ID:PO-9999                  fiscal cycle:21112
      Vendor ID:VEND99                     order type:SUBSCRIPT
    15)   requisition number:                      copies:9    
                call number:XX(9999999.999)                          
                  ISBN/ISSN:9999-999X           
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-999      
            holding code:CODEINFO1                   copies:1    
           date received:27/6/2012                             date lo
+aded:27/6/2012
              
        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO3                    copies:2    
           date received:27/9/2012                             date lo
+aded:27/6/2012
              
        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO2                     copies:1    
           date received:25/8/2012                             date lo
+aded:27/6/2012

                              List of Distributions                   
+           
                                                                      
+          
                  Produced Tuesday, 9 October, 2012 at 1:38 PM
                                                                      
+          


       Order ID:PO-1111                  fiscal cycle:21112
      Vendor ID:VEND11                     order type:SUBSCRIPT
    15)   requisition number:                      copies:417    
                call number:XX(11111111.111)                          
                  ISBN/ISSN:1111-111X           
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-111      
            holding code:CODEINFO9                   copies:5    
           date received:11/6/2012                             date lo
+aded:12/6/2012
              
        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO8                    copies:4    
           date received:11/9/2012                             date lo
+aded:12/6/2012
              
        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO7                     copies:3    
           date received:11/8/2012                             date lo
+aded:12/6/2012
           
        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO6                     copies:2    
           date received:11/8/2012                             date lo
+aded:12/6/2012
[download]

Output

VEND99
CODEINFO1
CODEINFO3
CODEINFO2

$VAR1 = {
          'vendorID' => 'VEND99',
          'copies' => '9',
          'fiscalCycle' => '21112',
          'distribution' => [
                              {
                                'dateLoaded' => '27/6/2012',
                                'dateReceived' => '27/6/2012',
                                'copies' => '1',
                                'holdingCode' => 'CODEINFO1'
                              },
                              {
                                'dateLoaded' => '27/6/2012',
                                'dateReceived' => '27/9/2012',
                                'copies' => '2',
                                'holdingCode' => 'CODEINFO3'
                              },
                              {
                                'dateLoaded' => '27/6/2012',
                                'dateReceived' => '25/8/2012',
                                'copies' => '1',
                                'holdingCode' => 'CODEINFO2'
                              }
                            ],
          'ISBN/ISSN' => '9999-999X',
          'title' => 'Item title here.',
          'orderID' => 'PO-9999',
          'requisitionNum' => '15'
        };
VEND11
CODEINFO9
CODEINFO8
CODEINFO7
CODEINFO6

$VAR1 = {
          'vendorID' => 'VEND11',
          'copies' => '417',
          'fiscalCycle' => '21112',
          'distribution' => [
                              {
                                'dateLoaded' => '12/6/2012',
                                'dateReceived' => '11/6/2012',
                                'copies' => '5',
                                'holdingCode' => 'CODEINFO9'
                              },
                              {
                                'dateLoaded' => '12/6/2012',
                                'dateReceived' => '11/9/2012',
                                'copies' => '4',
                                'holdingCode' => 'CODEINFO8'
                              },
                              {
                                'dateLoaded' => '12/6/2012',
                                'dateReceived' => '11/8/2012',
                                'copies' => '3',
                                'holdingCode' => 'CODEINFO7'
                              },
                              {
                                'dateLoaded' => '12/6/2012',
                                'dateReceived' => '11/8/2012',
                                'copies' => '2',
                                'holdingCode' => 'CODEINFO6'
                              }
                            ],
          'ISBN/ISSN' => '1111-111X',
          'title' => 'Item title here.',
          'requisitionNum' => '15',
          'orderID' => 'PO-1111'
        };
[download]

Included a subroutine and a call to it that shows how to handle accessing the hash a record at a time.

The code is commented, to assist with understanding it.

Let me know if you have any questions about this...

Enjoy!

In reply to Re^7: How best to strip text from a file? by Kenosis
in thread How best to strip text from a file? by bobdabuilda

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl Monk, Perl Meditation
	PerlMonks