chanakya has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm writing a script to parse a csv file (control file) and build a data structure.

I have created the sample script to create the data structure as given below.
I'd like to simplify the parsing procedure im my sample code(currently the function is 100 lines) and make it more robust.

Below is the structure of the control file
Tags,Marker,Description,Currency,Industry,SuperSector,Sector,Sub Curre +ncies FWA,WORLDS,World Index,USD,0001|1000,,0530|1770, FBRIC,AWBAX,All-World BRIC Index,USD,,,, FWBRA,WIBRE,Brazil Index,USD|9LC
The program should build the structure as below:
{ 'SectorCurrency' => '', 'Currency' => 'USD', 'SourceID' => 'WORLDS', 'Name' => 'World Index', 'ExchangeTag' => 'FWA' }; { 'SectorCurrency' => '', 'Currency' => 'USD', 'SourceID' => 'WORLDS', 'IndustryCode' => '0001', 'Name' => 'FTSE World Index', 'ExchangeTag' => 'FWA0001' }; { 'SectorCurrency' => '', 'Currency' => 'USD', 'SourceID' => 'WORLDS', 'IndustryCode' => '1000', 'Name' => 'FTSE World Index', 'ExchangeTag' => 'FWA1000' }; { 'SectorCurrency' => '', 'SectorCode' => '0530', 'Currency' => 'USD', 'SourceID' => 'WORLDS', 'Name' => 'FTSE World Index', 'ExchangeTag' => 'FWA0530' }; { 'SectorCurrency' => '', 'SectorCode' => '1770', 'Currency' => 'USD', 'SourceID' => 'WORLDS', 'Name' => 'FTSE World Index', 'ExchangeTag' => 'FWA1770' }; { 'SectorCurrency' => '', 'Currency' => 'USD', 'SourceID' => 'AWBRIC', 'Name' => 'FTSE All-World BRIC Index', 'ExchangeTag' => 'FBRIC' }; { 'SectorCurrency' => undef, 'Currency' => 'USD', 'SourceID' => 'WIBRA', 'Name' => 'FTSE Brazil Index', 'ExchangeTag' => 'FWBRA' }; { 'SectorCurrency' => undef, 'Currency' => '9LC', 'SourceID' => 'WIBRA', 'Name' => 'FTSE Brazil Index', 'ExchangeTag' => 'FWBRA_L' };
Parsing Rules:
1) Script should create ExchangeTag based on the "Industry" , "Currency", "Sector". ex 1: The first Tag "FWA" in the config file should build ExchangeTags 'FWA', 'FWA0001',
'FWA1000', 'FWA0530', 'FWIA770' (as this "Tag" has Industry and Sectors defined.

ex 2: The third Tag "FWBRA" in config file should build ExchangeTags 'FWBRA' and 'FWBRA_L'
(because there are two currencies USD/9LC). The "L" appended to ExchangeTag is the second character from the Currency "9LC".

Below is the my test code to parse according to the rules
#!/bin/perl use strict; use warnings; my $cfg_file = "/tmp/Allcontrol.csv"; read_control($cfg_file); sub read_control { my ($cfg_file) = @_; ## This is a internal module to parse and giveout data as hashref my $ctrl_data = ImportData(); my @IndxData; foreach (keys %$ctrl_data){ my $ExchangeTag = $ctrl_data->{$_}->{Tag}; my $Marker = $ctrl_data->{$_}->{Marker}; my $Name = $ctrl_data->{$_}->{Name}; my $Currency = $ctrl_data->{$_}->{Currency}; my $Industry = $ctrl_data->{$_}->{Industry}; my $SuperSector = $ctrl_data->{$_}->{SuperSector}; my $Sector = $ctrl_data->{$_}->{Sector}; my $Ind_Sec_Curr= $ctrl_data->{$_}->{Sub Currencies}; # Parse single/multiple currencies if(defined $Currency){ my @Currency = split /\|/, $Currency; foreach my $corecurr (@Currency){ my $row = {}; my $curr = substr($corecurr, 0,1); $curr = substr($corecurr, 1,1) if($curr eq "9"); if($curr eq "U"){ $row->{ExchangeTag} = $ExchangeTag; }else { $row->{ExchangeTag} = $ExchangeTag . "_" . $cu +rr; } $row->{Name} = $Name; $row->{Currency} = $corecurr; $row->{SectorCurrency} = $Ind_Sec_Curr; $row->{SourceID} = $Marker; push @IndxData, $row; } } #Parse single/multiple industries if (defined $Industry and $Industry ne ""){ my @industry = split /\|/, $Industry; foreach my $ind (@industry){ my $row = {}; $row->{Name} = $Name; $row->{Currency} = $Currency; $row->{SectorCurrency} = $Ind_Sec_Curr; $row->{IndustryCode} = $ind; $row->{SourceID} = $Marker; $row->{ExchangeTag} = $ExchangeTag . $ind; push @IndxData, $row; } } #Parse multiple/single Sectors if (defined $Sector and $Sector ne ""){ my @sector = split /\|/, $Sector; foreach my $sect (@sector){ my $row = {}; $row->{Name} = $Name; $row->{Currency} = $Currency; $row->{SectorCurrency} = $Ind_Sec_Curr; $row->{SectorCode} = $sect; $row->{SourceID} = $Marker; $row->{ExchangeTag} = $ExchangeTag . $sect; push @IndxData, $row; } } } die Dumper @IndxData; }

Replies are listed 'Best First'.
Re: Suggestions for simplifying script to parse csv data
by radiantmatrix (Parson) on Jul 27, 2007 at 14:37 UTC

    As with most common tasks in Perl, someone has already written an excellent (and fast!) CSV parser and made it available via the CPAN as Text::CSV_XS. Using that module, it's easy to build a generic solution that parses each row of your CSV file into a hash using names provided on the first row:

    #!/usr/bin/env perl use strict; use warnings; use Text::CSV_XS; use IO::File; my $io = IO::File->new('/tmp/Allcontrol.csv','<') # open file or die("Cannot open data source file: $!"); # or die trying my $csv = Text::CSV_XS->new(); my $head_row = $csv->getline($io); # get first line (headers) my @dataset; while (my $row = $csv->getline($io)) { my %row_hash = map { $headrow->[$_] => $row->[$_] } (0..$#{$headrow +}); push @dataset,\%row_hash; }

    At this point, @dataset is an array of hashes, where each hash represents a row in the database. From that general solution, you should be able to extrapolate your solution. For example, you can manipulate the data before pushing the row into the set, etc.

    Ramblings and references
    The Code that can be seen is not the true Code
    I haven't found a problem yet that can't be solved by a well-placed trebuchet
Re: Suggestions for simplifying script to parse csv data
by graff (Chancellor) on Jul 28, 2007 at 14:53 UTC
    It looks like you might have some mistakes in your explanation (e.g. your first "parsing rule" refers to "FWIA770", but your sample output data contains "FWA1770"), and there are some unexplained details in the data. (E.g. where do output strings like "FTSE" and "AWBRIC" come from? I didn't see these in the input or code as originally posted.)

    And you don't mention whether the ordering of array elements in the output AoH is important -- if the results need to be in a particular order, this affects how complicated or simple the script can get.

    I gather that parsing the CSV input isn't the main thing you're asking about. You want to know how to simplify and shorten the code you posted. The basic rule to apply there is: figure out how to eliminate repeated lines of code. As originally posted, your code contained the same set of assignment statements repeated in two or three different places, and this is most likely unnecessary. The code would be simpler, clearer, and easier to maintain/update if you reorganize or refactor it so that the repetitions aren't needed there are no repetitions.

    And apart from that, don't use a reference to a hash when just a simple hash will do (i.e. replace  $row = {}; with  %row = (); and then when pushing row hashes onto your output array, do it like this:  push @IndxData, { %row };

    Finally, you might decide that some of your temporary-storage variables aren't needed at all (e.g. do  for ( split /\|/, $Currency ) instead of assigning the split to an array just to loop over the array).