http://www.perlmonks.org?node_id=796730

TStanley has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow Monks!

I have been tasked with creating a CSV file that will be imported into a new document archiving system, and after a lot of hacking, I have turned a 100MB text file into a file just under 1MB in size after removing the non-needed parts and getting it formatted. Below is a sample of what I am currently working with:

1401,PERISH STORE INV,Quarterly,(V) INV810 USER NAME (EDIT REPORT IF 0 +5309 PRSH INV),EXTRACT WAS RUN),(V) INV820 USER NAME (PERISHABLE INVE +NTORY REPORT),(VTD8) D:\DEPT\ACCT\MAIL\INV820.DAT,(V) INV820C USER NA +ME (PERISHABLE INVENTORY-BAKERY),(QUARTERLY RUN ONLY) 1402,PERSH INV BOOKS,Quarterly,(V) INV805 58 COPIES - USER NAME,(V) IN +V805 2 COPIES - USER NAME,(V) INV805A XTRA COPIES-SAVE IN COMPUTER RO +OM,ANNUAL STORE INVENTORY ONLY:,(V) INV805 58 COPIES - USER NAME,(V) +INV805 2 COPIES - USER NAME,(V) INV805A XTRA COPIES-SAVE IN COMPUTER +ROOM 1403,BAKERY INV BOOKS,Quarterly,(V) INV805 35 COPIES - USER NAME,(V) I +NV805A 5 COPIES - SAVE IN COMPUTER ROOM,ANNUAL STORE INVENTORY ONLY:, +(J) INV805 35 COPIES - USER NAME,(J) INV805A 5 COPIES - SAVE IN COMPU +TER ROOM 1501,INV PRICE GUIDE,As Needed,(V) MSC052 BOTH COPIES ARE TO BE PUT IN + BINDERS AND,AND LEFT ON THE CABINETS BEHIND THE DP,SECRETARY'S DESK

What I need to accomplish from the above is to break down the fourth field and beyond so that each one is on its own line, along with the first three lines, so the first line would look something like:

1401,PERISH STORE INV,Quarterly,(V) INV810 USER NAME (EDIT REPORT IF 0 +5309 PRSH INV) 1401,PERISH STORE INV,Quarterly,(EXTRACT WAS RUN) 1401,PERISH STORE INV,Quarterly,(V) INV820 USER NAME (PERISHABLE INVEN +TORY REPORT) 1401,PERISH STORE INV,Quarterly,(VTD8) D:\DEPT\ACCT\MAIL\INV820.DAT 1401,PERISH STORE INV,Quarterly,(V) INV820C USER NAME (PERISHABLE INVE +NTORY-BAKERY)(QUARTERLY RUN ONLY)

As you can see, what makes this a challenge is that each line is a different size depending upon how many reports there are to be distributed, as well as the number of people that the reports go to.

My plan is to take each line of the csv file and split it into an array with the 4th element being a reference to another array which whould contain all of the remaining lines. How would I go about creating the output csv file to look like my desired output? My code so far is below:

#!C:\Perl\bin\perl use strict; use warnings; open INPUT "input.csv"||die "Can not open: $!\n"; open OUTPUT "output.csv"||die "Can not open: $!\n"; while(<INPUT>){ my @line = split /,/; my @line2 = @line[0..2]; my @dist = @line[3..$#line]; my $ref = \@dist; push @line2,$ref; ## Print to OUTPUT } close INPUT; close OUTPUT;

TStanley
--------
People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell

Replies are listed 'Best First'.
Re: Generating an output file
by kennethk (Abbot) on Sep 22, 2009 at 15:23 UTC
    What are you doing with these files? Is there a pre-existing script/program that you will feed them to, or are you just using this for temporary storage. If the latter, you may consider using the Storable module instead of creating some custom format. As well, you may want to use some of the existing Perl CSV modules (like Text::CSV) rather than rolling your own.

    A list of lists type structure would seem easiest for what you want do (like what you've done), although it would seem easier to create a hash keyed on a primary identifier (1401 in the example maybe). This would also let you give elements English names, which makes coming back to the file format later a lot easier. So maybe something like:

    #!C:\Perl\bin\perl use strict; use warnings; open my $input, "<", "input.csv" or die "Can not open: $!\n"; open my $output, ">", "output.csv" or die "Can not open: $!\n"; my %data = (); while(<$input>){ my @line = split /,/; my $key = shift @line; $data{$key} = {}; # Empty hash ref $data{$key}{store} = shift @line; $data{$key}{period} = shift @line; $data{$key}{dist} = \@line; # First three entries have been removed } foreach my $key (keys %data) { foreach my $dist ( @{ $data{$key}{dist} } ) { print $output join(",", $key, $data{$key}{store}, $data{$key}{peri +od}, $dist), "\n"; } } close $input; close $output;

    Please also take note that I have changed your open statements to 3-argument opens, swapped to lexical file handles and changed from the high precedence || to the low precedence or. The virtues of these structures have been extolled extensively in the archives.

      This worked exactly as advertised. Thank you for your help.

      TStanley
      --------
      People sleep peaceably in their beds at night only because rough men stand ready to do violence on their behalf. -- George Orwell
Re: Generating an output file
by bv (Friar) on Sep 22, 2009 at 15:23 UTC

    You could store the rest of the line (each sub-entry) in a separate array, then loop over that. Here's one I put together quickly:

    while(<INPUT>){ my (@prefix,@rest); chomp; (@prefix[0..2],@rest) = split /,/; for (@rest) { local $, = ','; local $\ = "\n"; print @prefix,$_; } }
    print pack("A25",pack("V*",map{1919242272+$_}(34481450,-49737472,6228,0,-285028276,6979,-1380265972)))
Re: Generating an output file
by Bloodnok (Vicar) on Sep 22, 2009 at 16:44 UTC
    Since being informed about Text::XSV, in such circumstances, I tend to use nothing else but it:
    use warnings; use strict; use Text::xSV; my $csv = Text::xSV->new(fh => *DATA, row_size_warning => undef); while (my $row = $csv->get_row()) { my @prefix = splice @$row, 0, 3; foreach (@$row) { print join ',', @prefix, $_ . "\n"; } } __DATA__ 1401,PERISH STORE INV,Quarterly,(V) INV810 USER NAME (EDIT REPORT IF 0 +5309 PRSH INV),EXTRACT WAS RUN),(V) INV820 USER NAME (PERISHABLE INVE +NTORY REPORT),(VTD8) D:\DEPT\ACCT\MAIL\INV820.DAT,(V) INV820C USER NA +ME (PERISHABLE INVENTORY-BAKERY),(QUARTERLY RUN ONLY) 1402,PERSH INV BOOKS,Quarterly,(V) INV805 58 COPIES - USER NAME,(V) IN +V805 2 COPIES - USER NAME,(V) INV805A XTRA COPIES-SAVE IN COMPUTER RO +OM,ANNUAL STORE INVENTORY ONLY:,(V) INV805 58 COPIES - USER NAME,(V) +INV805 2 COPIES - USER NAME,(V) INV805A XTRA COPIES-SAVE IN COMPUTER +ROOM 1403,BAKERY INV BOOKS,Quarterly,(V) INV805 35 COPIES - USER NAME,(V) I +NV805A 5 COPIES - SAVE IN COMPUTER ROOM,ANNUAL STORE INVENTORY ONLY:, +(J) INV805 35 COPIES - USER NAME,(J) INV805A 5 COPIES - SAVE IN COMPU +TER ROOM 1501,INV PRICE GUIDE,As Needed,(V) MSC052 BOTH COPIES ARE TO BE PUT IN + BINDERS AND,AND LEFT ON THE CABINETS BEHIND THE DP,SECRETARY'S DESK
    Which is, IMHO, not too unreadable (:-D) and when run, gives:
    $ perl tst.pl 1401,PERISH STORE INV,Quarterly,(V) INV810 USER NAME (EDIT REPORT IF 0 +5309 PRSH INV) 1401,PERISH STORE INV,Quarterly,EXTRACT WAS RUN) 1401,PERISH STORE INV,Quarterly,(V) INV820 USER NAME (PERISHABLE INVEN +TORY REPORT) 1401,PERISH STORE INV,Quarterly,(VTD8) D:\DEPT\ACCT\MAIL\INV820.DAT 1401,PERISH STORE INV,Quarterly,(V) INV820C USER NAME (PERISHABLE INVE +NTORY-BAKERY) 1401,PERISH STORE INV,Quarterly,(QUARTERLY RUN ONLY) 1402,PERSH INV BOOKS,Quarterly,(V) INV805 58 COPIES - USER NAME 1402,PERSH INV BOOKS,Quarterly,(V) INV805 2 COPIES - USER NAME 1402,PERSH INV BOOKS,Quarterly,(V) INV805A XTRA COPIES-SAVE IN COMPUTE +R ROOM 1402,PERSH INV BOOKS,Quarterly,ANNUAL STORE INVENTORY ONLY: 1402,PERSH INV BOOKS,Quarterly,(V) INV805 58 COPIES - USER NAME 1402,PERSH INV BOOKS,Quarterly,(V) INV805 2 COPIES - USER NAME 1402,PERSH INV BOOKS,Quarterly,(V) INV805A XTRA COPIES-SAVE IN COMPUTE +R ROOM 1403,BAKERY INV BOOKS,Quarterly,(V) INV805 35 COPIES - USER NAME 1403,BAKERY INV BOOKS,Quarterly,(V) INV805A 5 COPIES - SAVE IN COMPUTE +R ROOM 1403,BAKERY INV BOOKS,Quarterly,ANNUAL STORE INVENTORY ONLY: 1403,BAKERY INV BOOKS,Quarterly,(J) INV805 35 COPIES - USER NAME 1403,BAKERY INV BOOKS,Quarterly,(J) INV805A 5 COPIES - SAVE IN COMPUTE +R ROOM 1501,INV PRICE GUIDE,As Needed,(V) MSC052 BOTH COPIES ARE TO BE PUT IN + BINDERS AND 1501,INV PRICE GUIDE,As Needed,AND LEFT ON THE CABINETS BEHIND THE DP 1501,INV PRICE GUIDE,As Needed,SECRETARY'S DESK
    Update:

    In retrospect i.e. whilst watching my daughter learning to swim, it occurred to me that I probably ought to recommend the use of the print() &/or print_row() methods (of Text::xSV) to generate the output.

    A user level that continues to overstate my experience :-))
Re: Generating an output file
by vitoco (Hermit) on Sep 22, 2009 at 18:45 UTC

    This is valid when there is no quoted fields with embeded commas:

    while (<INPUT>) { chomp; my ($pref,$data) = (/^((?:[^,]*,){3})(.*)/); print OUTPUT "$pref$_\n" for split /,/, $data; }