Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

creating array of hashes from input file

by chimiXchanga (Novice)
on Mar 07, 2017 at 13:34 UTC ( #1183825=perlquestion: print w/replies, xml ) Need Help??

chimiXchanga has asked for the wisdom of the Perl Monks concerning the following question:

So there's this sample data I have and I'm trying to parse it into a csv. What's the best approach of doing it? I was trying to build an array of hashes but I can't think of a proper way of capturing just the data I need. Thank you!
Page 1 of 3 100 Civic Center Way Calabasas, California 91302 7/12/2012 9:21:02AM MONTHLY EXTERNAL MODIFICATIONS PERMITS REPORT Jun 2012 Permit Issued Address Description 06/01/2012 26166 ROYMOR DR Upgrade panel from 100 amp to 200 amp 06/04/2012 24956 NORMANS WAY (6) light fixtures @ patio; (3) branch circuits; (4) electric heaters 06/05/2012 4273 VICASA DR Construct 339 SF Covered Loggia

Replies are listed 'Best First'.
Re: creating array of hashes from input file
by Eily (Monsignor) on Mar 07, 2017 at 13:42 UTC

    I can't think of a proper way of capturing just the data I need
    First remove the data you don't need, et voilą!

    You'll have to be more specific than that, we can't guess what is useful information for you. I can't even understand how you're supposed to read your data. See How do I post a question effectively? Be aware that you shouldn't expect to be given a fully functional solution, but rather advice, and maybe corrections. So you'll have to show some work.

Re: creating array of hashes from input file
by Corion (Patriarch) on Mar 07, 2017 at 13:49 UTC

    My approach when parsing lists depends on the nature of the list.

    If I can find out from looking at a single line what kind it is, then I use regular expressions to fill out a hash and flush the record whenever a new set starts.

    If I can't find out from looking at a single line what kind it is, I use counters or flags to know what line I am on.

    In your case, it looks to me as if you basically have a report with some header data and then three payload lines, the issuance date, the address and the description. One ugly thing seems to be that the address and the permission can span multiple lines, but from the unrepresentative example you have posted, each item seems to be delimited by a blank line from the previous item.

    Going from these assumptions, my approach would be something like the following (untested):

    #!perl -w use strict; use Data::Dumper; # Output a row of information sub flush { my( $record ) = @_; if( $record->{permit} ) { print Dumper $record; }; delete $record->{permit}; }; # This will collect all information for one entry: my %info; my $last_page; my $expected_pages; my $record_kind; my %next_record = ( address => 'description', description => undef, ); while(<DATA>) { if( m!^Page (\d+) of (\d+)! ) { $last_page = $1; $expected_pages ||= $2; next; }; if( m!^(Jan|Feb|...|Jun|...) (19\d\d|20\d\d)! ) { $info{ report_date } = "$2-$1"; next; }; # ... more code to skip the header left for the reader next if( m!MONTHLY EXTERNAL MODIFICATIONS PERMITS REPORT! ); next if( m!^Permit Issued! ); if( m!^(\d\d)/(\d\d)/((?:19|20)\d\d)$! ) { flush(\%info); $info{ permit } = "$3-$2-$1"; $record_kind = 'address'; <>; # skip empty line next }; # Fast-forward until the next set of lines if( $record_kind ) { while( <> !~ /^\s*$/ ) { s!\s*$!!; $info{ $record_kind } .= " " . $_; }; $record_kind = $next_record{ $record_kind }; } else { die "Unknown line [$_] on line $."; }; }; warn "Uhoh - expected $expected_pages but only read up to $last_page" if( $expected_pages != $last_page ); __DATA__ Page 1 of 3 100 Civic Center Way Calabasas, California 91302 7/12/2012 9:21:02AM MONTHLY EXTERNAL MODIFICATIONS PERMITS REPORT Jun 2012 Permit Issued Address Description 06/01/2012 26166 ROYMOR DR Upgrade panel from 100 amp to 200 amp 06/04/2012 24956 NORMANS WAY (6) light fixtures @ patio; (3) branch circuits; (4) electric heaters 06/05/2012 4273 VICASA DR Construct 339 SF Covered Loggia
      that looks beautiful! may I kindly ask, would this layout would be an easier an approach to parse? And if so, how would you approach joining the data on the next line to the proper column?
      100 Civic Center Way + Page 1 of 3 Calabasas, California 91302 + 7/12/2012 9:21:02AM MONTHLY EXTERNAL MODIFICATIONS PERMITS REPORT Jun 2012 Permit Issued Address Description 06/01/2012 26166 ROYMOR Upgrade panel fr +om 100 amp to 200 amp DR 06/04/2012 24956 NORMANS (6) light fixtur +es @ patio; (3) branch circuits; (4) electric heaters WAY 06/05/2012 4273 VICASA Construct 339 SF + Covered Loggia DR 06/07/2012 26011 ALIZIA CANYON R/R (1) <100K BT +U Furnace in garage; () <100K BTU condenser outsid +e (NO DUCTS TO BE CHANGED OUT) DR E 06/07/2012 4240 LOST HILLS R/R (7) windows +(like for like) AT LEAST ONE PANE MUST BE TEMPERED RD 1503 06/08/2012 3574 ELM Construct Retain +ing Wall in front of (E) retaining wall: 4 1/2' average x 3 +2 LF = approx. 144 SF DR 06/13/2012 4026 TOWHEE Construct a 460 +SF Pool/49 SF Spa DR
        would this layout would be an easier an approach to parse

        Maybe, depends on what pages 2 and 3 are like. Are you parsing many reports all formatted the same ?. Or is the 3 pages just a small sample of the real report.

        #!perl use strict; use Data::Dumper; my $infile = 'report1.txt'; # date address description my $fmt = "A16 A38 A*"; my @data = (); my $recno = -1; my $flag = 0; open IN,'<',$infile or die "$infile $!"; while (<IN>){ chomp; next unless /\S/; $flag = 0 if /Page \d of \d/; my ($date,$addr,$desc) = unpack $fmt,$_; if ( $date =~ /\d\d.\d\d.20\d\d/ ){ $flag = 1; ++$recno; $data[$recno] = [ $date,$addr,$desc ]; } elsif ($flag) { $data[$recno][1] .= ' '.$addr if $addr; $data[$recno][2] .= ' '.$desc if $desc; } } close IN; print scalar(@data)." records read\n"; print Dumper \@data;
        poj

        Here, instead of a regex I would likely use unpack with the appropriate template(s).

        Append to %info until you encounter an empty line, then flush.

Re: creating array of hashes from input file
by huck (Prior) on Mar 07, 2017 at 23:37 UTC

    Corion made good points, but it needs a flush(\%info); at the end.

    Here is a more non form dependent single state driven method. I like state machines too! It also fixes some common excel problems noted.

    #!perl -w use strict; use Data::Dumper; # delimiter style , pick one my $delim; # this mode produces comma delimted files, special magic needs to be a +dded to output $delim=','; # this produces tab delimted files, fields just cant contain a tab # the output needs to have a .txt filetype and be opened with excel # or another filetype (say .tdf) if that filetype is assigned to be op +ened by excel # i like tdf better, but you said csv # $delim="\t"; sub delimfix{ # excel has problems with text files # something excell may think as a date but should be text # 2012-12-12 # 2/4 # something excell may think is a number but is text # 1e123 # leading zeros as text # 00001 # leading equals as text # =x # (leading equals-dquote, training dquote) fixes these problems (the +equals function) # There still exists a long string problem with ="..." # |001-----0|010-----0|020-----0|030-----0|040-----0|050-----0|060- +----0|070-----0|080-----0|090-----0|101-----0|110-----0|120-----0|130 +-----0|140-----0|150-----0|160-----0|170-----0|180-----0|190-----0|20 +1-----0|210-----0|220-----0|230-----0|240-----0|250-----0|260-----0|2 +70-----0|280-----0|290-----0 my $field=shift; if ($field=~m!"!) { $field=~s!"!""!g; } if ($delim eq ',' && $field=~m!,!) { # csv also has problems with commas in string # but if it has a comma it doesnt have the other problems return '"'.$field.'"'; } return '="'.$field.'"'; }; sub infoprint { # Output a row of information my($title,$record) = @_; # print $title." data\n"; print Dumper $record; my @delimline; if (lc($title) eq 'top') { print delimfix($record->{title})."\n"; print "\n"; for my $f (split("\t",$record->{getheaders})) { push @delimline, +delimfix($f)} } else { if ($record->{permitdate}=~m!^\d+/\d+/\d+!) { push @delimline,$record->{permitdate}; } else { push @delimline,delimfix($record->{permitdate}); } push @delimline,delimfix($record->{address}); push @delimline,delimfix($record->{description}); } print join($delim,@delimline)."\n"; }; # used in getheaders to add tabs my $linejoin=' '; # This will collect all information for one entry: my %info; # what is in %info my $infotype='Top'; my %topinfo; # state machine data my $record_kind='none'; my $last_record_blank=1; my %next_record = ( # state table was => now none => 'page', page => 'topaddr', topaddr => 'rundate', rundate => 'title', title => 'getheaders', getheaders => 'getheaders', # hold state till manual reset + founddesc => 'permitdate', permitdate => 'address', address => 'description', description => 'permitdate', ); while(<DATA>) { chomp; s!\t! !g; # can have tabs in fields , kill them if ( m!^\s*$!) { # blanks dont matter $last_record_blank=1; next; } elsif ($last_record_blank) { # new data so go to next state $record_kind = $next_record{ $record_kind }; if ($record_kind eq 'permitdate') { # move to permitdate means + dump full data infoprint($infotype,\%info); %info=(); $infotype='Permit'; } $last_record_blank=0; } if (defined($info{ $record_kind })) { $info{ $record_kind } .= $linejoin . $_; } else { $info{ $record_kind } = $_; } if ($record_kind eq 'getheaders') { if ( m!^Description! ) { $record_kind='founddesc'; # manual move to next state $linejoin=' '; %topinfo=%info; } else {$linejoin="\t"; } } }; infoprint('Permit',\%info,); print "------\n"; print delimfix($topinfo{page})."\n"; print delimfix($topinfo{topaddr})."\n"; print delimfix($topinfo{rundate})."\n"; __DATA__ Page 1 of 3 100 Civic Center Way Calabasas, California 91302 7/12/2012 9:21:02AM "hi" MONTHLY EXTERNAL MODIFICATIONS PERMITS REPORT Jun 2012 Permit Issued Address Description 06/01/2012 26166 ROYMOR DR Upgrade panel from 100 amp to 200 amp 06/04/2012 24956 NORMANS WAY (6) light fixtures @ patio; (3) branch circuits; (4) electric heaters 06/05/2012 4273 VICASA DR Construct 339 SF Covered Loggia

      I played with the above so much i factored out the array of hashs to make the output file instead after putting it in, So the following is refactored again to construct the AoH, THEN make the output file from it.

      #!perl -w use strict; use Data::Dumper; # all fo them my $aoh=[]; # used in getheaders to add tabs my $linejoin=' '; # This will collect all information for one entry: my $infohash={type=>'top'}; # state machine data my $record_kind='none'; my $last_record_blank=1; my %next_record = ( # state table was => now none => 'page', page => 'topaddr', topaddr => 'rundate', rundate => 'title', title => 'getheaders', getheaders => 'getheaders', # hold state till manual reset + founddesc => 'permitdate', permitdate => 'address', address => 'description', description => 'permitdate', ); while(<DATA>) { chomp; s!\t! !g; # can have tabs in fields, kill them if ( m!^\s*$!) { # blanks dont matter $last_record_blank=1; next; } elsif ($last_record_blank) { # new data so go to next state $record_kind = $next_record{ $record_kind }; if ($record_kind eq 'permitdate') { # move to permitdate means + dump full data push @$aoh,$infohash; $infohash={type=>'permit'}; } $last_record_blank=0; } if (defined($infohash->{ $record_kind })) { $infohash->{ $record_kind } .= $linejoin . $_; } else { $infohash-> { $record_kind } = $_; } if ($record_kind eq 'getheaders') { if ( m!^Description! ) { $record_kind='founddesc'; # manual move to next state $linejoin=' '; } else {$linejoin="\t"; } } }; push @$aoh,$infohash; print Dumper($aoh); ######################################################### # make csv ######################################################### # delimiter style , pick one my $delim; # this mode produces comma delimted files, special magic needs to be a +dded to output $delim=','; # this produces tab delimted files, fields just cant contain a tab # the output needs to have a .txt filetype and be opened with excel # or another filetype (say .tdf) if that filetype is assigned to be op +ened by excel # i like tdf better, but you said csv # $delim="\t"; for my $hash (@$aoh) { infoprint ($hash->{type},$hash); } infoprint ('bot',@{$aoh}[0]); sub delimfix{ # excel has problems with text files # something excell may think as a date but should be text # 2012-12-12 # 2/4 # something excell may think is a number but is text # 1e123 # leading zeros as text # 00001 # leading equals as text # =x # (leading equals-dquote, training dquote) fixes these problems (the +equals function) # There still exists a long string problem with ="..." # |001-----0|010-----0|020-----0|030-----0|040-----0|050-----0|060- +----0|070-----0|080-----0|090-----0|101-----0|110-----0|120-----0|130 +-----0|140-----0|150-----0|160-----0|170-----0|180-----0|190-----0|20 +1-----0|210-----0|220-----0|230-----0|240-----0|250-----0|260-----0|2 +70-----0|280-----0|290-----0 my $field=shift; if ($field=~m!"!) { $field=~s!"!""!g; } if ($delim eq ',' && $field=~m!,!) { # csv also has problems with commas in string # but if it has a comma it doesnt have the other problems return '"'.$field.'"'; } return '="'.$field.'"'; }; sub infoprint { # Output a row of information my($title,$record) = @_; # print $title." data\n"; print Dumper $record; my @delimline; if (lc($title) eq 'top') { print delimfix($record->{title})."\n"; print "\n"; for my $f (split("\t",$record->{getheaders})) { push @delimline, +delimfix($f)} } elsif (lc($title) eq 'bot') { print "------\n"; print delimfix($record->{page})."\n"; print delimfix($record->{topaddr})."\n"; print delimfix($record->{rundate})."\n"; } else { if ($record->{permitdate}=~m!^\d+/\d+/\d+!) { push @delimline,$record->{permitdate}; } else { push @delimline,delimfix($record->{permitdate}); } push @delimline,delimfix($record->{address}); push @delimline,delimfix($record->{description}); } print join($delim,@delimline)."\n"; }; __DATA__ Page 1 of 3 100 Civic Center Way Calabasas, California 91302 7/12/2012 9:21:02AM "hi" MONTHLY EXTERNAL MODIFICATIONS PERMITS REPORT Jun 2012 Permit Issued Address Description 06/01/2012 26166 ROYMOR DR Upgrade panel from 100 amp to 200 amp 06/04/2012 24956 NORMANS WAY (6) light fixtures @ patio; (3) branch circuits; (4) electric heaters 06/05/2012 4273 VICASA DR Construct 339 SF Covered Loggia

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1183825]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (2)
As of 2023-02-03 23:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer not to run the latest version of Perl because:







    Results (30 votes). Check out past polls.

    Notices?