Re^8: How best to strip text from a file?

Replies are listed 'Best First'.
Re^9: How best to strip text from a file? by Kenosis (Priest) on Nov 09, 2012 at 06:02 UTC
You're very welcome--and not a pain nor more problems! It was actually quite fun to work on and am glad it's working for you. I think you're correct about the reason why the defined-or-equals (`//=`) throws an error on your *nix box. I believe that the operator was introduced in v5.10. You can recode each of the `//=` lines as follows: This: `$hash{orderID} //= do { /Order ID:(\S+)/; $1 };` [download] Can be written as: `$hash{orderID} = $1 if !defined $hash{orderID} and /Order ID:(\S+)/;` [download] Again, let me know if you have any more questions about this script...	[reply] [d/l] [select]
Re^10: How best to strip text from a file? by bobdabuilda (Beadle) on Nov 13, 2012 at 00:06 UTC
That certainly did the trick!! Nice one. So now, I've been playing around with the logic behind skipping duplicated orders within the file... and of course, having issues with it... Using the same script and data, and keeping in mind that orders, if duplicated, will always (I'm pretty certain) be duplicated in order - ie. an order will be replicated 3 or 4 or more times, and then the report moves on to a different order. Sorry for the long bit of code here, but figured it easiest to show you what I'm trying : use strict; use warnings; use Data::Dumper; use Spreadsheet::WriteExcel; # Place a filename into $recordsFile to read Orders from that file # else the Orders below __DATA__ will be used for demo purposes #my $recordsFile = 'finished_report_sample.txt'; my $recordsFile = ''; my ( @records, @orders ); my $recSeparator = 'Order ID:'; # Orders will initially be array elements 1 .. n in @orders; element 0 + is initially the first page header { # Set the record separator local $/ = $recSeparator; # If there's a file name, try to read from that file if ($recordsFile) { open my $fh, '<', $recordsFile or die $!; @records = <$fh>; close $fh; } # End If else { @records = <DATA>; } # End Else } # End preparatory loop # Remove the first page header shift @records; # Add Order ID: back into each record for later matching $_ = "$recSeparator$_" for @records; ########## Added for writing to Excel # Open a new xls file then create a sheet my $workbook = Spreadsheet::WriteExcel->new('distlist.xls'); my $worksheet= $workbook->add_worksheet(); # Write headings $worksheet->write(0,0,'Fiscal Year'); $worksheet->write(0,1,'Vendor'); $worksheet->write(0,2,'PO Number'); $worksheet->write(0,3,'Orderline'); $worksheet->write(0,4,'Title'); $worksheet->write(0,5,'ISBN/ISSN'); $worksheet->write(0,6,'# copies for Title'); $worksheet->write(0,7,'Distribution'); $worksheet->write(0,8,'Date Received'); $worksheet->write(0,9,'Date Loaded'); $worksheet->write(0,10,'Number of Copies'); # Initialise spreadheet counters my $row=1; my $column=0; # Set this up ready for checking for duplicate orders my $previousOrder=""; # Iterate through each record (Order) for my $record (@records) { my %hash; # Treat the record string like a file, opening it for reading open my $sh, '<', \$record or die "Unable to open record string: $ +!"; # Read the string like a file, one line at a time now while (<$sh>) { $hash{orderID} = $1 if !defined $hash{orderID} and /Ord +er ID:(\S+)/; $hash{fiscalCycle} = $1 if !defined $hash{fiscalCycle} and +/cycle:(\d+)/; $hash{vendorID} = $1 if !defined $hash{vendorID} and /Ve +ndor ID:(\S+)/; $hash{requisitionNum} = $1 if !defined $hash{requisitionNum} a +nd /\s+(\d+).+requisition/; $hash{copies} = $1 if !defined $hash{copies} and /copi +es:(\d+)/; $hash{'ISBN/ISSN'} = $1 if !defined $hash{'ISBN/ISSN'} and +m{ISBN/ISSN:(\S+)}; $hash{title} = $1 if !defined $hash{title} and /Title +:(.+)/; # CHeck to see if it's a repeat order, skip if it is. my ($hashReference) = \%hash; if (($previousOrder eq $$hashReference{orderID}) && ($previousOrde +r ne "")) { print "Order: $previousOrder HashOrder: $$hashReference{o +rderID} \n"; print "Order already processed. Skipping...\n"; last; } # End If else { $worksheet->write($row,0,$$hashReference{fiscalCycle}); $worksheet->write($row,1,$$hashReference{vendorID}); $worksheet->write($row,2,$$hashReference{orderID}); $worksheet->write($row,3,$$hashReference{requisitionNum}); $worksheet->write($row,4,$$hashReference{title}); $worksheet->write($row,5,$$hashReference{'ISBN/ISSN'}); $worksheet->write($row,6,$$hashReference{copies}); $previousOrder = $$hashReference{orderID}; # $row+=1; # Distributions started? if (/Distribution--/) { # Save the current record separator my $oldRecSeparator = $/; # Set a new record separator local $/ = 'Distribution--'; # Read the string like a file, a distribution 'chunk' +at a time while (<$sh>) { #I realise this hashing is now superfluous, with data bein +g written # direct to Excel, but am keeping changes to a minimum until + I get # the overall functionality correct. my %tempHash; ( $tempHash{holdingCode} ) = /code:(\S+)/; ( $tempHash{copies} ) = /copies:(\d+)/; ( $tempHash{dateReceived} ) = /received:(\S+)/; ( $tempHash{dateLoaded} ) = /loaded:(\S+)/; $worksheet->write($row,7,$tempHash{holdingCode}); $worksheet->write($row,8,$tempHash{dateReceived}); $worksheet->write($row,9,$tempHash{dateLoaded}); $worksheet->write($row,10,$tempHash{copies}); $row+=1; push @{ $hash{distribution} }, \%tempHash; } # End While # Restore the old record separator $/ = $oldRecSeparator; } # End If } # End Else } # End While # Work with the filled-in %hash by sending a reference to it to a +subroutine # This is a complete record # writeToSpreadSheet( \%hash ); # print Dumper \%hash; # Done 'reading' the string close $sh; } # End For - last of the loops $workbook->close(); # Printing in a subroutine's not a good idea, but done here only to sh +ow how to access the hash #sub writeToSpreadSheet { # my ($hashReference) = @_; # # The $$ notation dereferences the hash reference # print $$hashReference{vendorID}, "\n"; # # The @{} notation deferences the array reference; the arrow opera +tor deferences to get hash value # for my $distribution ( @{ $$hashReference{distribution} } ) { # print $distribution->{holdingCode}, "\n"; # } # # print "\n"; #} __DATA__ List of Distributions Produced Tuesday, 9 October, 2012 at 1:38 PM Order ID:PO-9999 fiscal cycle:21112 Vendor ID:VEND99 order type:SUBSCRIPT 15) requisition number: copies:9 call number:XX(9999999.999) ISBN/ISSN:9999-999X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO1 copies:1 date received:27/6/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO3 copies:2 date received:27/9/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO2 copies:1 date received:25/8/2012 date lo +aded:27/6/2012 Order ID:PO-9999 fiscal cycle:21112 Vendor ID:VEND99 order type:SUBSCRIPT 15) requisition number: copies:9 call number:XX(9999999.999) ISBN/ISSN:9999-999X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO1 copies:1 date received:27/6/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO3 copies:2 date received:27/9/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO2 copies:1 date received:25/8/2012 date lo +aded:27/6/2012 List of Distributions Produced Tuesday, 9 October, 2012 at 1:38 PM Order ID:PO-1111 fiscal cycle:21112 Vendor ID:VEND11 order type:SUBSCRIPT 15) requisition number: copies:417 call number:XX(11111111.111) ISBN/ISSN:1111-111X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO9 copies:5 date received:11/6/2012 date lo +aded:12/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO8 copies:4 date received:11/9/2012 date lo +aded:12/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO7 copies:3 date received:11/8/2012 date lo +aded:12/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO6 copies:2 date received:11/8/2012 date lo +aded:12/6/2012 [download] Now, when I run this, somehow $previousOrder is getting a value in it at the "if" logic check, even though the logic check is the first time the variable is seen after it's initialised as ""... as you can see from the output of the first print command. The other thing I don't understand from this outcome, is that the resultant XLS file has only 2 fields (as well as the headers) written to it - the fiscal year and PO number. I would have thought that, if it made it to the section that writes the PO stuff at all, it would write it all - I don't understand how it can only write part of it? There's also an issue with the XLS file itself for some reason - giving an error when opening it... but I'm off to research that one myself now while I wait for suggestions on what I've stuffed up with my logic for the duplication check :)	[reply] [d/l]
Re^11: How best to strip text from a file? by Kenosis (Priest) on Nov 13, 2012 at 02:35 UTC
Hi, bobdabuilda! Am glad this is shaping up for you, and I think you've done well building on to this... Here are a few items for you to consider... Use another hash to keep track of previously-seen Orders. Without re-posting the entire code, here's what you can add to make this functional: Add a `%seen` hash: `my ( @records, @orders, %seen );` [download] Add a `RECORD:` label at the start of the line, in front of the `for` that iterates through each record (Order): `RECORD: for my $record (@records) { ...` [download] Finally, use an `if()` statement to process the Order ID (instead of a single line like with the other fields): `if (/Order ID:(\S+)/) { next RECORD if $seen{$1}; $seen{$1}++; $hash{orderID} = $1; }` [download] I suspect you can see what this does, but if an Order ID: match is found, it first checks to see if the ID's already been 'seen.' If so, it gets the next record for processing, else it tags the Order as seen and sets the hash. Write to your Excel file from within your subroutine. 1) You've got a complete record by the time the subroutine is called in the script (see its current position). You can send your subroutine the row number: `writeToSpreadSheet( $row, \%hash );` [download] Then, in the subroutine: `sub writeToSpreadSheet { my ($row, $hashReference) = @_; ...` [download] 2) By writing to the Excel spread sheet from within the subroutine, you're somewhat modularizing your program by keeping its functionality separate. Let me know how your spread sheet writing is going or if you encounter any other issues...	[reply] [d/l] [select]
Re^12: How best to strip text from a file? by bobdabuilda (Beadle) on Nov 13, 2012 at 03:54 UTC
Re^13: How best to strip text from a file? by Kenosis (Priest) on Nov 13, 2012 at 04:30 UTC
Some notes below your chosen depth have not been shown here
Re^11: How best to strip text from a file? by bobdabuilda (Beadle) on Nov 13, 2012 at 00:26 UTC
Just wanting to add that I found the issue with the XLS file is related to cells being written to twice in the code (according to the WriteExcel author id:710537 - check the link later on in the thread by him pointing to the Google Groups thread). So I went hunting, added an extra $row+=1 in early on in the piece in order to see what's going on... and it appears that my duplicate logic is working, to a point - with the additional line increasing my row number, I now get still just the Year and PO number - but I am getting one each for each of the PO's in the DATA. I put an extra few duplicates of the same data in to check - and still only got one copy of each of the Year and PO's... so I guess that's kinda a good thing. Now just need to work out why it's skipping the rest, so now I'm off to go about pulling it apart again to try and work this out... Scarily, I think I'm starting to enjoy this... lol	[reply]
Re^10: How best to strip text from a file? by bobdabuilda (Beadle) on Nov 13, 2012 at 01:51 UTC
Alright! Getting somewhere. Worked out part of the issue was based around the location of the WriteExcel block set up to write the Order details - placed it inside the "if (/Distribution--/) {" loop, and it sorted that issue out nicely. Makes sense now that I look back on it, as AFTER that distribution matches, it's finished processing all of the Order header fields - whereas prior to that, the first few times it hit the WriteExcel stuff, it was only partially processed. So, now it's trimming the Order list down nicely. TOO nicely. On looking back over the data, I noticed that each order had potential of having more than one title - so I was skipping data that should be kept. So, added in another check on the title, and that's not behaving as expected... of course... With this latest version, and the example data provided, the output I am expecting is 2 order entries for PC-9999, and 1 for PC-1111. 2 entries for the first, because there are 3 entries, but 2 of those 3 are duplicates (3rd one I added a "2" onto the title to make it different). However, it's only printing out one order for each... and I can't work out why... Any suggestions on how to track the issue down please? use strict; use warnings; use Data::Dumper; use Spreadsheet::WriteExcel; # Place a filename into $recordsFile to read Orders from that file # else the Orders below __DATA__ will be used for demo purposes #my $recordsFile = 'finished_report_sample.txt'; my $recordsFile = ''; my ( @records, @orders ); my $recSeparator = 'Order ID:'; # Orders will initially be array elements 1 .. n in @orders; element 0 + is initially the first page header { # Set the record separator local $/ = $recSeparator; # If there's a file name, try to read from that file if ($recordsFile) { open my $fh, '<', $recordsFile or die $!; @records = <$fh>; close $fh; } # End If else { @records = <DATA>; } # End Else } # End preparatory loop # Remove the first page header shift @records; # Add Order ID: back into each record for later matching $_ = "$recSeparator$_" for @records; ########## Added for writing to Excel # Open a new xls file then create a sheet my $workbook = Spreadsheet::WriteExcel->new('distlist.xls'); my $worksheet= $workbook->add_worksheet(); # Write headings $worksheet->write(0,0,'Fiscal Year'); $worksheet->write(0,1,'Vendor'); $worksheet->write(0,2,'PO Number'); $worksheet->write(0,3,'Orderline'); $worksheet->write(0,4,'Title'); $worksheet->write(0,5,'ISBN/ISSN'); $worksheet->write(0,6,'# copies for Title'); $worksheet->write(0,7,'Distribution'); $worksheet->write(0,8,'Date Received'); $worksheet->write(0,9,'Date Loaded'); $worksheet->write(0,10,'Number of Copies'); # Initialise spreadheet counters my $row=1; my $column=0; # Set this up ready for checking for duplicate orders my $previousOrder=""; my $previousTitle=""; # Iterate through each record (Order) for my $record (@records) { my %hash; # For testing WriteExcel # $row+=1; # Treat the record string like a file, opening it for reading open my $sh, '<', \$record or die "Unable to open record string: $ +!"; # Read the string like a file, one line at a time now while (<$sh>) { $hash{orderID} = $1 if !defined $hash{orderID} and /Ord +er ID:(\S+)/; $hash{fiscalCycle} = $1 if !defined $hash{fiscalCycle} and +/cycle:(\d+)/; $hash{vendorID} = $1 if !defined $hash{vendorID} and /Ve +ndor ID:(\S+)/; $hash{requisitionNum} = $1 if !defined $hash{requisitionNum} a +nd /\s+(\d+).+requisition/; $hash{copies} = $1 if !defined $hash{copies} and /copi +es:(\d+)/; $hash{'ISBN/ISSN'} = $1 if !defined $hash{'ISBN/ISSN'} and +m{ISBN/ISSN:(\S+)}; $hash{title} = $1 if !defined $hash{title} and /Title +:(.+)/; my ($hashReference) = \%hash; # Had to put this in to suppress warnings about $$hashReference{ti +tle} not being populated yet during the loop no warnings 'uninitialized'; # Check to see if it's a repeat order and title, skip if it is. if (($previousOrder eq $$hashReference{orderID}) && ($previousOrde +r ne "")){ if (($previousTitle eq $$hashReference{title}) && ($previousTi +tle ne "")) { print "Order: $previousOrder HashOrder: $$hashReference{ord +erID} Title: $previousTitle HashTitle: $$hashReference{title} \n"; print "Order already processed. Skipping...\n"; last; } # End if } # End If else { # Distributions started? if (/Distribution--/) { $worksheet->write($row,0,$$hashReference{fiscalCycle}); $worksheet->write($row,1,$$hashReference{vendorID}); $worksheet->write($row,2,$$hashReference{orderID}); $worksheet->write($row,3,$$hashReference{requisitionNum}); $worksheet->write($row,4,$$hashReference{title}); $worksheet->write($row,5,$$hashReference{'ISBN/ISSN'}); $worksheet->write($row,6,$$hashReference{copies}); # Set the current title and Order number for duplicate + checking $previousOrder = $$hashReference{orderID}; $previousTitle = $$hashReference{title}; # Save the current record separator my $oldRecSeparator = $/; # Set a new record separator local $/ = 'Distribution--'; # Read the string like a file, a distribution 'chunk' +at a time while (<$sh>) { #I realise this hashing is now superfluous, with data bein +g written # direct to Excel, but am keeping changes to a minimum until + I get # the overall functionality correct. my %tempHash; ( $tempHash{holdingCode} ) = /code:(\S+)/; ( $tempHash{copies} ) = /copies:(\d+)/; ( $tempHash{dateReceived} ) = /received:(\S+)/; ( $tempHash{dateLoaded} ) = /loaded:(\S+)/; $worksheet->write($row,7,$tempHash{holdingCode}); $worksheet->write($row,8,$tempHash{dateReceived}); $worksheet->write($row,9,$tempHash{dateLoaded}); $worksheet->write($row,10,$tempHash{copies}); $row+=1; push @{ $hash{distribution} }, \%tempHash; } # End While # Restore the old record separator $/ = $oldRecSeparator; } # End If } # End Else } # End While # Work with the filled-in %hash by sending a reference to it to a +subroutine # This is a complete record # writeToSpreadSheet( \%hash ); # print Dumper \%hash; # Done 'reading' the string close $sh; } # End For - last of the loops $workbook->close(); # Printing in a subroutine's not a good idea, but done here only to sh +ow how to access the hash #sub writeToSpreadSheet { # my ($hashReference) = @_; # # The $$ notation dereferences the hash reference # print $$hashReference{vendorID}, "\n"; # # The @{} notation deferences the array reference; the arrow opera +tor deferences to get hash value # for my $distribution ( @{ $$hashReference{distribution} } ) { # print $distribution->{holdingCode}, "\n"; # } # # print "\n"; #} __DATA__ List of Distributions Produced Tuesday, 9 October, 2012 at 1:38 PM Order ID:PO-9999 fiscal cycle:21112 Vendor ID:VEND99 order type:SUBSCRIPT 15) requisition number: copies:9 call number:XX(9999999.999) ISBN/ISSN:9999-999X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO1 copies:1 date received:27/6/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO3 copies:2 date received:27/9/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO2 copies:1 date received:25/8/2012 date lo +aded:27/6/2012 Order ID:PO-9999 fiscal cycle:21112 Vendor ID:VEND99 order type:SUBSCRIPT 15) requisition number: copies:9 call number:XX(9999999.999) ISBN/ISSN:9999-999X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO1 copies:1 date received:27/6/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO3 copies:2 date received:27/9/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO2 copies:1 date received:25/8/2012 date lo +aded:27/6/2012 Order ID:PO-9999 fiscal cycle:21112 Vendor ID:VEND99 order type:SUBSCRIPT 15) requisition number: copies:9 call number:XX(9999999.999) ISBN/ISSN:9999-999X Title:Item title here 2. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO1 copies:1 date received:27/6/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO3 copies:2 date received:27/9/2012 date lo +aded:27/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-999 holding code:CODEINFO2 copies:1 date received:25/8/2012 date lo +aded:27/6/2012 List of Distributions Produced Tuesday, 9 October, 2012 at 1:38 PM Order ID:PO-1111 fiscal cycle:21112 Vendor ID:VEND11 order type:SUBSCRIPT 15) requisition number: copies:417 call number:XX(11111111.111) ISBN/ISSN:1111-111X Title:Item title here. ISSN:9999-999X Publication info:More text here about stuff Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO9 copies:5 date received:11/6/2012 date lo +aded:12/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO8 copies:4 date received:11/9/2012 date lo +aded:12/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO7 copies:3 date received:11/8/2012 date lo +aded:12/6/2012 Distribution-- packing list:STUFF-I-DONT-NEED-111 holding code:CODEINFO6 copies:2 date received:11/8/2012 date lo +aded:12/6/2012 [download]	[reply] [d/l]
Re^11: How best to strip text from a file? by Kenosis (Priest) on Nov 13, 2012 at 02:46 UTC
...each order had potential of having more than one title... What do multiple titles look like in an Order?	[reply]

You're very welcome--and not a pain nor more problems! It was actually quite fun to work on and am glad it's working for you.

I think you're correct about the reason why the defined-or-equals (//=) throws an error on your *nix box. I believe that the operator was introduced in v5.10. You can recode each of the //= lines as follows:

This:

$hash{orderID}        //= do { /Order ID:(\S+)/;        $1 };
[download]

Can be written as:

$hash{orderID} = $1 if !defined $hash{orderID} and /Order ID:(\S+)/;
[download]

Again, let me know if you have any more questions about this script...

[reply]
[d/l]
[select]

That certainly did the trick!! Nice one. So now, I've been playing around with the logic behind skipping duplicated orders within the file... and of course, having issues with it...

Using the same script and data, and keeping in mind that orders, if duplicated, will always (I'm pretty certain) be duplicated in order - ie. an order will be replicated 3 or 4 or more times, and then the report moves on to a different order.

Sorry for the long bit of code here, but figured it easiest to show you what I'm trying :

use strict;
use warnings;
use Data::Dumper;
use Spreadsheet::WriteExcel;

# Place a filename into $recordsFile to read Orders from that file
#  else the Orders below __DATA__ will be used for demo purposes
#my $recordsFile = 'finished_report_sample.txt';
my $recordsFile = '';


my ( @records, @orders );
my $recSeparator = 'Order ID:';

# Orders will initially be array elements 1 .. n in @orders; element 0
+ is initially the first page header
{
    # Set the record separator
    local $/ = $recSeparator;

    # If there's a file name, try to read from that file
    if ($recordsFile) {
        open my $fh, '<', $recordsFile or die $!;
        @records = <$fh>;
        close $fh;
    } # End If
    else {
        @records = <DATA>;
    } # End Else
} # End preparatory loop

# Remove the first page header
shift @records;

# Add Order ID: back into each record for later matching
$_ = "$recSeparator$_" for @records;

########## Added for writing to Excel
# Open a new xls file then create a sheet
my $workbook = Spreadsheet::WriteExcel->new('distlist.xls');
my $worksheet= $workbook->add_worksheet();

# Write headings
$worksheet->write(0,0,'Fiscal Year');
$worksheet->write(0,1,'Vendor');
$worksheet->write(0,2,'PO Number');
$worksheet->write(0,3,'Orderline');
$worksheet->write(0,4,'Title');
$worksheet->write(0,5,'ISBN/ISSN');
$worksheet->write(0,6,'# copies for Title');
$worksheet->write(0,7,'Distribution');
$worksheet->write(0,8,'Date Received');
$worksheet->write(0,9,'Date Loaded');
$worksheet->write(0,10,'Number of Copies');

# Initialise spreadheet counters
my $row=1;
my $column=0;

# Set this up ready for checking for duplicate orders
my $previousOrder="";

# Iterate through each record (Order)
for my $record (@records) {
    my %hash;
    # Treat the record string like a file, opening it for reading
    open my $sh, '<', \$record or die "Unable to open record string: $
+!";

    # Read the string like a file, one line at a time now
    while (<$sh>) {
        $hash{orderID}        = $1 if !defined $hash{orderID} and /Ord
+er ID:(\S+)/;
        $hash{fiscalCycle}    = $1 if !defined $hash{fiscalCycle} and 
+/cycle:(\d+)/;
        $hash{vendorID}       = $1 if !defined $hash{vendorID} and /Ve
+ndor ID:(\S+)/;
        $hash{requisitionNum} = $1 if !defined $hash{requisitionNum} a
+nd /\s+(\d+).+requisition/;
        $hash{copies}         = $1 if !defined $hash{copies} and /copi
+es:(\d+)/;
        $hash{'ISBN/ISSN'}    = $1 if !defined $hash{'ISBN/ISSN'} and 
+m{ISBN/ISSN:(\S+)};
        $hash{title}          = $1 if !defined $hash{title} and /Title
+:(.+)/;

      # CHeck to see if it's a repeat order, skip if it is.
      my ($hashReference) = \%hash;
    if (($previousOrder eq $$hashReference{orderID}) && ($previousOrde
+r ne "")) {
            print "Order: $previousOrder  HashOrder: $$hashReference{o
+rderID} \n";
        print "Order already processed. Skipping...\n";
        last;
    }    # End If
      else {
          $worksheet->write($row,0,$$hashReference{fiscalCycle});
          $worksheet->write($row,1,$$hashReference{vendorID});
          $worksheet->write($row,2,$$hashReference{orderID});
          $worksheet->write($row,3,$$hashReference{requisitionNum});
          $worksheet->write($row,4,$$hashReference{title});
          $worksheet->write($row,5,$$hashReference{'ISBN/ISSN'});
          $worksheet->write($row,6,$$hashReference{copies});
      $previousOrder = $$hashReference{orderID};
#          $row+=1;
            # Distributions started?
            if (/Distribution--/) {

                # Save the current record separator
                my $oldRecSeparator = $/;

                # Set a new record separator
                local $/ = 'Distribution--';

                # Read the string like a file, a distribution 'chunk' 
+at a time
                while (<$sh>) {

            #I realise this hashing is now superfluous, with data bein
+g written
          # direct to Excel, but am keeping changes to a minimum until
+ I get
          # the overall functionality correct.
                    my %tempHash;

                    ( $tempHash{holdingCode} )  = /code:(\S+)/;
                    ( $tempHash{copies} )       = /copies:(\d+)/;
                    ( $tempHash{dateReceived} ) = /received:(\S+)/;
                    ( $tempHash{dateLoaded} )   = /loaded:(\S+)/;

                $worksheet->write($row,7,$tempHash{holdingCode});
                $worksheet->write($row,8,$tempHash{dateReceived});
                $worksheet->write($row,9,$tempHash{dateLoaded});
                $worksheet->write($row,10,$tempHash{copies});
                $row+=1;
                    push @{ $hash{distribution} }, \%tempHash;
                }    # End While

                # Restore the old record separator
                $/ = $oldRecSeparator;
            }    # End If
    }    # End Else
    } # End While

    # Work with the filled-in %hash by sending a reference to it to a 
+subroutine
    # This is a complete record
#  writeToSpreadSheet( \%hash );
#    print Dumper \%hash;

    # Done 'reading' the string
    close $sh;

} # End For - last of the loops

  $workbook->close();

# Printing in a subroutine's not a good idea, but done here only to sh
+ow how to access the hash
#sub writeToSpreadSheet {
#    my ($hashReference) = @_;
#
    # The $$ notation dereferences the hash reference
#    print $$hashReference{vendorID}, "\n";
#
    # The @{} notation deferences the array reference; the arrow opera
+tor deferences to get hash value
#    for my $distribution ( @{ $$hashReference{distribution} } ) {
#        print $distribution->{holdingCode}, "\n";
#    }
#
#    print "\n";
#}

__DATA__
                             List of Distributions

                  Produced Tuesday, 9 October, 2012 at 1:38 PM



       Order ID:PO-9999                  fiscal cycle:21112
      Vendor ID:VEND99                     order type:SUBSCRIPT
    15)   requisition number:                      copies:9
                call number:XX(9999999.999)
                  ISBN/ISSN:9999-999X
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO1                   copies:1
           date received:27/6/2012                             date lo
+aded:27/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO3                    copies:2
           date received:27/9/2012                             date lo
+aded:27/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO2                     copies:1
           date received:25/8/2012                             date lo
+aded:27/6/2012


       Order ID:PO-9999                  fiscal cycle:21112
      Vendor ID:VEND99                     order type:SUBSCRIPT
    15)   requisition number:                      copies:9
                call number:XX(9999999.999)
                  ISBN/ISSN:9999-999X
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO1                   copies:1
           date received:27/6/2012                             date lo
+aded:27/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO3                    copies:2
           date received:27/9/2012                             date lo
+aded:27/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO2                     copies:1
           date received:25/8/2012                             date lo
+aded:27/6/2012

                              List of Distributions

                  Produced Tuesday, 9 October, 2012 at 1:38 PM



       Order ID:PO-1111                  fiscal cycle:21112
      Vendor ID:VEND11                     order type:SUBSCRIPT
    15)   requisition number:                      copies:417
                call number:XX(11111111.111)
                  ISBN/ISSN:1111-111X
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO9                   copies:5
           date received:11/6/2012                             date lo
+aded:12/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO8                    copies:4
           date received:11/9/2012                             date lo
+aded:12/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO7                     copies:3
           date received:11/8/2012                             date lo
+aded:12/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO6                     copies:2
           date received:11/8/2012                             date lo
+aded:12/6/2012
[download]

Now, when I run this, somehow $previousOrder is getting a value in it at the "if" logic check, even though the logic check is the first time the variable is seen after it's initialised as ""... as you can see from the output of the first print command.

The other thing I don't understand from this outcome, is that the resultant XLS file has only 2 fields (as well as the headers) written to it - the fiscal year and PO number. I would have thought that, if it made it to the section that writes the PO stuff at all, it would write it all - I don't understand how it can only write part of it?

There's also an issue with the XLS file itself for some reason - giving an error when opening it... but I'm off to research that one myself now while I wait for suggestions on what I've stuffed up with my logic for the duplication check :)

[reply]
[d/l]

Hi, bobdabuilda!

Am glad this is shaping up for you, and I think you've done well building on to this...

Here are a few items for you to consider...

Use another hash to keep track of previously-seen Orders. Without re-posting the entire code, here's what you can add to make this functional:

Add a %seen hash:

my ( @records, @orders, %seen );
[download]

Add a RECORD: label at the start of the line, in front of the for that iterates through each record (Order):

RECORD: for my $record (@records) { ...
[download]

Finally, use an if() statement to process the Order ID (instead of a single line like with the other fields):

if (/Order ID:(\S+)/) {
    next RECORD if $seen{$1};
    $seen{$1}++;
    $hash{orderID} = $1;
}
[download]

I suspect you can see what this does, but if an Order ID: match is found, it first checks to see if the ID's already been 'seen.' If so, it gets the next record for processing, else it tags the Order as seen and sets the hash.

Write to your Excel file from within your subroutine. 1) You've got a complete record by the time the subroutine is called in the script (see its current position). You can send your subroutine the row number:

writeToSpreadSheet( $row,  \%hash );
[download]

Then, in the subroutine:

sub writeToSpreadSheet {
    my ($row, $hashReference) = @_;
...
[download]

2) By writing to the Excel spread sheet from within the subroutine, you're somewhat modularizing your program by keeping its functionality separate.

Let me know how your spread sheet writing is going or if you encounter any other issues...

[reply]
[d/l]
[select]

Just wanting to add that I found the issue with the XLS file is related to cells being written to twice in the code (according to the WriteExcel author id:710537 - check the link later on in the thread by him pointing to the Google Groups thread).

So I went hunting, added an extra $row+=1 in early on in the piece in order to see what's going on... and it appears that my duplicate logic is working, to a point - with the additional line increasing my row number, I now get still just the Year and PO number - but I am getting one each for each of the PO's in the DATA. I put an extra few duplicates of the same data in to check - and still only got one copy of each of the Year and PO's... so I guess that's kinda a good thing. Now just need to work out why it's skipping the rest, so now I'm off to go about pulling it apart again to try and work this out...

Scarily, I think I'm starting to enjoy this... lol

[reply]

Alright! Getting somewhere. Worked out part of the issue was based around the location of the WriteExcel block set up to write the Order details - placed it inside the "if (/Distribution--/) {" loop, and it sorted that issue out nicely. Makes sense now that I look back on it, as AFTER that distribution matches, it's finished processing all of the Order header fields - whereas prior to that, the first few times it hit the WriteExcel stuff, it was only partially processed.

So, now it's trimming the Order list down nicely. TOO nicely. On looking back over the data, I noticed that each order had potential of having more than one title - so I was skipping data that should be kept.

So, added in another check on the title, and that's not behaving as expected... of course...

With this latest version, and the example data provided, the output I am expecting is 2 order entries for PC-9999, and 1 for PC-1111. 2 entries for the first, because there are 3 entries, but 2 of those 3 are duplicates (3rd one I added a "2" onto the title to make it different). However, it's only printing out one order for each... and I can't work out why...

Any suggestions on how to track the issue down please?

use strict;
use warnings;
use Data::Dumper;
use Spreadsheet::WriteExcel;

# Place a filename into $recordsFile to read Orders from that file
#  else the Orders below __DATA__ will be used for demo purposes
#my $recordsFile = 'finished_report_sample.txt';
my $recordsFile = '';


my ( @records, @orders );
my $recSeparator = 'Order ID:';

# Orders will initially be array elements 1 .. n in @orders; element 0
+ is initially the first page header
{
    # Set the record separator
    local $/ = $recSeparator;

    # If there's a file name, try to read from that file
    if ($recordsFile) {
        open my $fh, '<', $recordsFile or die $!;
        @records = <$fh>;
        close $fh;
    } # End If
    else {
        @records = <DATA>;
    } # End Else
} # End preparatory loop

# Remove the first page header
shift @records;

# Add Order ID: back into each record for later matching
$_ = "$recSeparator$_" for @records;

########## Added for writing to Excel
# Open a new xls file then create a sheet
my $workbook = Spreadsheet::WriteExcel->new('distlist.xls');
my $worksheet= $workbook->add_worksheet();

# Write headings
$worksheet->write(0,0,'Fiscal Year');
$worksheet->write(0,1,'Vendor');
$worksheet->write(0,2,'PO Number');
$worksheet->write(0,3,'Orderline');
$worksheet->write(0,4,'Title');
$worksheet->write(0,5,'ISBN/ISSN');
$worksheet->write(0,6,'# copies for Title');
$worksheet->write(0,7,'Distribution');
$worksheet->write(0,8,'Date Received');
$worksheet->write(0,9,'Date Loaded');
$worksheet->write(0,10,'Number of Copies');

# Initialise spreadheet counters
my $row=1;
my $column=0;

# Set this up ready for checking for duplicate orders
my $previousOrder="";
my $previousTitle="";

# Iterate through each record (Order)
for my $record (@records) {
    my %hash;

  # For testing WriteExcel
#  $row+=1;

    # Treat the record string like a file, opening it for reading
    open my $sh, '<', \$record or die "Unable to open record string: $
+!";

    # Read the string like a file, one line at a time now
    while (<$sh>) {
        $hash{orderID}        = $1 if !defined $hash{orderID} and /Ord
+er ID:(\S+)/;
        $hash{fiscalCycle}    = $1 if !defined $hash{fiscalCycle} and 
+/cycle:(\d+)/;
        $hash{vendorID}       = $1 if !defined $hash{vendorID} and /Ve
+ndor ID:(\S+)/;
        $hash{requisitionNum} = $1 if !defined $hash{requisitionNum} a
+nd /\s+(\d+).+requisition/;
        $hash{copies}         = $1 if !defined $hash{copies} and /copi
+es:(\d+)/;
        $hash{'ISBN/ISSN'}    = $1 if !defined $hash{'ISBN/ISSN'} and 
+m{ISBN/ISSN:(\S+)};
        $hash{title}          = $1 if !defined $hash{title} and /Title
+:(.+)/;

      my ($hashReference) = \%hash;

    # Had to put this in to suppress warnings about $$hashReference{ti
+tle} not being populated yet during the loop
    no warnings 'uninitialized';

      # Check to see if it's a repeat order and title, skip if it is.
    if (($previousOrder eq $$hashReference{orderID}) && ($previousOrde
+r ne "")){
        if (($previousTitle eq $$hashReference{title}) && ($previousTi
+tle ne "")) {
          print "Order: $previousOrder  HashOrder: $$hashReference{ord
+erID} Title: $previousTitle HashTitle: $$hashReference{title} \n";
            print "Order already processed. Skipping...\n";
            last;
      } # End if
    }    # End If
      else {
            # Distributions started?
            if (/Distribution--/) {
          $worksheet->write($row,0,$$hashReference{fiscalCycle});
          $worksheet->write($row,1,$$hashReference{vendorID});
          $worksheet->write($row,2,$$hashReference{orderID});
          $worksheet->write($row,3,$$hashReference{requisitionNum});
          $worksheet->write($row,4,$$hashReference{title});
          $worksheet->write($row,5,$$hashReference{'ISBN/ISSN'});
          $worksheet->write($row,6,$$hashReference{copies});

                # Set the current title and Order number for duplicate
+ checking
          $previousOrder = $$hashReference{orderID};
          $previousTitle = $$hashReference{title};

                # Save the current record separator
                my $oldRecSeparator = $/;

                # Set a new record separator
                local $/ = 'Distribution--';

                # Read the string like a file, a distribution 'chunk' 
+at a time
                while (<$sh>) {

            #I realise this hashing is now superfluous, with data bein
+g written
          # direct to Excel, but am keeping changes to a minimum until
+ I get
          # the overall functionality correct.
                    my %tempHash;

                    ( $tempHash{holdingCode} )  = /code:(\S+)/;
                    ( $tempHash{copies} )       = /copies:(\d+)/;
                    ( $tempHash{dateReceived} ) = /received:(\S+)/;
                    ( $tempHash{dateLoaded} )   = /loaded:(\S+)/;

                $worksheet->write($row,7,$tempHash{holdingCode});
                $worksheet->write($row,8,$tempHash{dateReceived});
                $worksheet->write($row,9,$tempHash{dateLoaded});
                $worksheet->write($row,10,$tempHash{copies});
                $row+=1;
                    push @{ $hash{distribution} }, \%tempHash;
                }    # End While

                # Restore the old record separator
                $/ = $oldRecSeparator;
            }    # End If
    }    # End Else
    } # End While

    # Work with the filled-in %hash by sending a reference to it to a 
+subroutine
    # This is a complete record
#  writeToSpreadSheet( \%hash );
#    print Dumper \%hash;

    # Done 'reading' the string
    close $sh;

} # End For - last of the loops

  $workbook->close();

# Printing in a subroutine's not a good idea, but done here only to sh
+ow how to access the hash
#sub writeToSpreadSheet {
#    my ($hashReference) = @_;
#
    # The $$ notation dereferences the hash reference
#    print $$hashReference{vendorID}, "\n";
#
    # The @{} notation deferences the array reference; the arrow opera
+tor deferences to get hash value
#    for my $distribution ( @{ $$hashReference{distribution} } ) {
#        print $distribution->{holdingCode}, "\n";
#    }
#
#    print "\n";
#}

__DATA__
                             List of Distributions

                  Produced Tuesday, 9 October, 2012 at 1:38 PM



       Order ID:PO-9999                  fiscal cycle:21112
      Vendor ID:VEND99                     order type:SUBSCRIPT
    15)   requisition number:                      copies:9
                call number:XX(9999999.999)
                  ISBN/ISSN:9999-999X
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO1                   copies:1
           date received:27/6/2012                             date lo
+aded:27/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO3                    copies:2
           date received:27/9/2012                             date lo
+aded:27/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO2                     copies:1
           date received:25/8/2012                             date lo
+aded:27/6/2012


       Order ID:PO-9999                  fiscal cycle:21112
      Vendor ID:VEND99                     order type:SUBSCRIPT
    15)   requisition number:                      copies:9
                call number:XX(9999999.999)
                  ISBN/ISSN:9999-999X
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO1                   copies:1
           date received:27/6/2012                             date lo
+aded:27/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO3                    copies:2
           date received:27/9/2012                             date lo
+aded:27/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO2                     copies:1
           date received:25/8/2012                             date lo
+aded:27/6/2012


       Order ID:PO-9999                  fiscal cycle:21112
      Vendor ID:VEND99                     order type:SUBSCRIPT
    15)   requisition number:                      copies:9
                call number:XX(9999999.999)
                  ISBN/ISSN:9999-999X
         Title:Item title here 2.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO1                   copies:1
           date received:27/6/2012                             date lo
+aded:27/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO3                    copies:2
           date received:27/9/2012                             date lo
+aded:27/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-999
            holding code:CODEINFO2                     copies:1
           date received:25/8/2012                             date lo
+aded:27/6/2012

                              List of Distributions

                  Produced Tuesday, 9 October, 2012 at 1:38 PM



       Order ID:PO-1111                  fiscal cycle:21112
      Vendor ID:VEND11                     order type:SUBSCRIPT
    15)   requisition number:                      copies:417
                call number:XX(11111111.111)
                  ISBN/ISSN:1111-111X
         Title:Item title here.
         ISSN:9999-999X
         Publication info:More text here about stuff

        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO9                   copies:5
           date received:11/6/2012                             date lo
+aded:12/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO8                    copies:4
           date received:11/9/2012                             date lo
+aded:12/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO7                     copies:3
           date received:11/8/2012                             date lo
+aded:12/6/2012

        Distribution--
            packing list:STUFF-I-DONT-NEED-111
            holding code:CODEINFO6                     copies:2
           date received:11/8/2012                             date lo
+aded:12/6/2012
[download]

[reply]
[d/l]

...each order had potential of having more than one title...

What do multiple titles look like in an Order?

[reply]