Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

lightweight CSV parser

by wrinkles (Pilgrim)
on Dec 22, 2011 at 07:42 UTC ( #944731=perlquestion: print w/ replies, xml ) Need Help??
wrinkles has asked for the wisdom of the Perl Monks concerning the following question:

Good Evening Monks,

I have written a simple Movable Type plugin that creates a MT tag modifier which takes the text output from an MT tag, finds and parses a CSV string, then returns the data as a simple (no nested data) hash or array or string which is then captured as an MT template variable. All within an ordinary template tag.

It is on GitHub

The data fields are extracted from the surrounding content and from neighboring data with index and substr functions. I use the split function to parse the CSV into arrays and hashes, which is certainly lightweight, but of course brittle.

I'd appreciate any tips to make the code more robust. Any dependencies should be pure perl so I can just drop them in a extlib directory.

Most of the Text:CSV type modules read from files, so I was unable to make them work with a string variable.

The plugin code follows. Any advice appreciated.

package MoreData::Plugin; use MT 4.2; use warnings; use strict; use MT::Util qw(decode_html); sub moredata { my ($str, $val, $ctx) = @_; # val is the data name and optional form +at (or default format) $str = decode_html($str); my $dataname; # retrieved from $val my $content; # extracted from $str my $datastring; # extracted from $str my $result; # processed from $datastring # retrieve configuration defaults my $config = _moredata_config($ctx); # open tag is required my $opentag = $config->{'moredata_opentag'} ? $config->{'moredata_op +entag'} : ''; die "MoreData requires an open tag: $!" unless ($opentag); my $closetag = $config->{'moredata_closetag'} ? $config->{'moredata_ +closetag'} : ''; # not required # retrieve format and data separator from tag parameters my $format_cfg = $config->{'moredata_format'}; my $format = ''; if (ref($val) eq 'ARRAY') { $dataname = defined($val->[0]) ? $val->[0] : '' ; #dataname can be + empty $format = $val->[1] ? ($val->[1]) : $format_cfg ; } else { $dataname = $val ? $val : ''; $format = $format_cfg } die "No format specified in args or configuration: $!" unless $forma +t; # extract datastrings my $substrings_aref = _retrieve_strings($str, $opentag, $closetag, $ +dataname); unless (scalar($substrings_aref)) { # no stringrefs returned die "cannot extract substrings from string: $!"; } $content = _trim($$substrings_aref[0]); return $content if $dataname eq "__content__"; $datastring = _trim($$substrings_aref[1]); return $datastring if ($dataname eq '__data__'); # return all data +as string return $datastring unless (length($datastring)); # no sense in proce +ssing an empty string my $datasep_cfg = $config->{'moredata_datasep'}; length($datasep_cfg) or die "The data separation string must be conf +igured: $!"; my $hashsep_cfg = $config->{'moredata_hashsep'} || ''; length($hashsep_cfg) or die "The hash separation string must be conf +igured: $!"; # send data and parameters to desired format for result if ($format eq 'array') { $result = _moredata_array($datastring, $datasep_cfg); } elsif ($format eq 'hash') { $result = _moredata_hash($datastring, $datasep_cfg, $hashsep_cfg); } elsif ($format eq 'string') { $result = $datastring; } else { die "no format : $!"; } return $result; } sub _moredata_array { my ($datastring, $datasep_cfg) = @_; my @list = split /\s*$datasep_cfg\s*/, $datastring; scalar @list ? \@list : undef; } sub _moredata_hash { my ($datastring, $datasep_cfg, $hashsep_cfg) = @_; my $datasep = _escape($datasep_cfg); my $hashsep = _escape($hashsep_cfg); my %hash; my @list = split /\s*$datasep\s*/, $datastring; foreach my $item (@list) { my $count = my @array = split(/\s*$hashsep\s*/, $item); die "Count of $count is not a pair of values in hash assignment of + values @array : $!" if ($count != (2 || 0)); $hash{$array[0]} = $array[1]; } scalar %hash ? \%hash : undef; } sub _moredata_config { my $ctx = shift; my $plugin = MT->component("MoreData"); my $blog = $ctx->stash('blog'); if ( !$blog ) { my $blog_id = $ctx->var('blog_id'); $blog = MT->model('blog')->load($blog_id); } my $blog_id = $blog->id; my $scope = "blog:" . $blog_id; my $config = $plugin->get_config_hash($scope); return $config; } # return content and data strings from string and tags sub _retrieve_strings { my ($str, $opentag, $closetag, $dataname) = @_; my $content = $str; my $datastring = ''; # content is all non-data, datastring all data my @substrings = ($content, $datastring); die "MoreData requires an open tag in plugin configuration: $!" unle +ss (length($opentag)); # check that we have a string with length. my $stringlength = length($str); return \@substrings unless ($stringlength); # return unless opentag is found in string my $openposition = index($str,$opentag); # start of _all_ the data return \@substrings if ($openposition == -1); # close tag cannot be subset of open tag die "MoreData close tag cannot be substring of the open tag: $!" unless (index($opentag,$closetag) == -1); # extract the content and data strings my $closeposition; # end of _all_ the data if ((length($closetag)) && (rindex($str,$closetag) != -1)) { # close +tag found $closeposition = index($str,$closetag); } else { $closeposition = $stringlength; } my $datalength = $closeposition - $openposition; # length of _all_ d +ata die "MoreData open tag must precede the close tag: $!" if ($dataleng +th < 0); #tags in wrong order; $content = substr($str, 0, $openposition) . substr($str, ($closeposi +tion + length($closetag)), $stringlength); #length argument can be be +yond the end # data includes open tag but not close tag $datastring = substr($str, $openposition, $datalength); # _all_ data @substrings = ($content, $datastring); return \@substrings if ($dataname eq '__data__'); # return content a +nd data as strings return \@substrings if ($dataname eq '__content__'); # return conten +t and data as strings return \@substrings unless length($datastring); # no reason to proce +ss empty datastring # search for named daatastring my $nameopentag = $opentag . $dataname . '='; # selected data requir +es the equal sign appended # locate the desired selected data within the datastring my $nameopenposition = index($datastring, $nameopentag); if ($nameopenposition == -1) { # if named data can't be found, retur +n substrings with empty datastring @substrings = ($content, ''); return \@substrings; } my $length_tagname = length($nameopentag); my $namecloseposition; if (index($datastring, $opentag, $nameopenposition + 1) != -1 ) { # + if we find another opentag following current one $namecloseposition = index($datastring, $opentag, $nameopenpositio +n + 1); # find next opentag } elsif (index($datastring, $closetag, $nameopenposition + 1) != -1 +) { # otherwise use closing tag $namecloseposition = index($datastring, $closetag, $nameopenpositi +on + 1); } else { # else just use end of file $namecloseposition = $datalength; } $datalength = $namecloseposition - $nameopenposition - $length_tagna +me; my $datastring = substr($datastring, $nameopenposition + $length_tag +name, $datalength); @substrings = ($content, $datastring); return \@substrings; } sub _escape { my $separator = shift; my $regex = '\\' . join('\\', split(/''/, $separator) ); return $regex; } # Perl trim function to remove whitespace from the start and end of th +e string sub _trim { my $string = shift; $string =~ s/^\s+//; $string =~ s/\s+$//; return $string; } 1; # Copyright 2011 Rick Bychowski # This program is free software; you can redistribute it and/or modify + it # under the terms of either: the GNU General Public License as publish +ed # by the Free Software Foundation; or the Artistic License.

Comment on lightweight CSV parser
Download Code
Re: lightweight CSV parser
by Anonymous Monk on Dec 22, 2011 at 07:50 UTC

    Most of the Text:CSV type modules read from files, so I was unable to make them work with a string variable.

    check the documentation again, because they all work with strings as well, anyway

    open my($fh), '<' , \$stringOfCsv; ...
      That appears to be a filehandle reference, not a regular string. I actually tried initially to use YAML strings, and with YAML::Syck was able to get a prototype working by opening a string reference as a filehandle, then loaded with YAML::Syck::LoadFile($fh) . That was the only YAML parser that worked like that, and when I tried the same thing within Movable Type, it wouldn't read the filehandle. "No such file or directory -type errors. So I gave up.

      It looked simple, but I'm missing some key concept.

      Maybe parsing CSV is easier. I'll give it a try.

        That appears to be a filehandle reference, not a regular string ...

        Well, if you insist http://search.cpan.org/perldoc/Text::CSV#SYNOPSIS only works with files, that is how you treat a regular string ($stringOfCsv), as a filehandle

Re: lightweight CSV parser
by Tux (Monsignor) on Dec 22, 2011 at 09:38 UTC

    As Anonymous Monk already said, parsing from strings is supported from the early days, not only by the PerlIO layer (file handles on scalars are natively supported) but also directly from strings.:

    my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); # Using ScalarIO open my $fh, "<", \$csv_stream; while (my $row = $csv->getline ($fh)) { # just like a file, but now on a string ... } close $fh; # Using strings directly. Much less reliable! foreach my $line (@csv_strings) { my @row = $csv->parse ($line); }

    Don't try to rewrite a CSV parser. The de-facto parsers Text::CSV_XS and Text::CSV have been tested by millions and already have dozens of options and features as requested by the community.


    Enjoy, Have FUN! H.Merijn
      It works! I've spent the last several days being haunted by The Ghosts of "Don't Write Your Own Crappy Parsers" past, present, and future. There's still time! You, there lad, yes you, go fetch me a Text::CSV module. Merry Christmas!
      Thanks Tux, I converted my split functions to use Text::CSV. Here's the new and much-improved code:
      sub _moredata_array { my ($datastring, $datasep_cfg) = @_; use Text::CSV; my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1, sep_char = +> "$datasep_cfg" }); my $io; open ($io, "<:encoding(utf8)", \$datastring) or die "Cannot use CSV: + $!".Text::CSV->error_diag (); my $row = $csv->getline ($io); $row || undef; } sub _moredata_hash { my ($datastring, $hashsep) = @_; use Text::CSV; my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1, sep_char = +> "$hashsep" }); my $io; open ($io, "<:encoding(utf8)", \$datastring) or die "Cannot use CSV: + $!".Text::CSV->error_diag (); my %hash; while (my $colref = $csv->getline($io)) { my $count = scalar @{$colref}; die "$count is an odd number of keys and values at @{$colref}.\n" +if ($count%2); $hash{$colref->[0]} = $colref->[1]; } scalar %hash ? \%hash : undef; }

        As you asked for generic comment in your OP, a few remarks

        # modules are usually not "use"d inside subs, but just once at the cod +e start use Text::CSV; sub _moredata_array { my ($datastring, $datasep_cfg) = @_; # Do not/never quote variables if you do not explicitly want to fo +rce PV context # e.g. if $datasep_cfg is undefined is means something completely +different then # the empty string which it is forced to when quoted my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1, sep_char +=> $datasep_cfg }); # you can declare $io as lexical (my) in the open call itself # At this point,Text::CSV->error_diag () is undefined, as $csv has + not even been used # and opening of the scalar as file handle does not interact with +$csv (yet) open my $io, "<:encoding(utf8)", \$datastring or die "Cannot use C +SV: $!"; # no need to declare a variable # the return value can possible be a reference to an empty list. t +hink if you want # that. || undef is useless here return $csv->getline ($io); } sub _moredata_hash { my ($datastring, $hashsep) = @_; my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1, sep_char +=> $hashsep }); open my $io, "<:encoding(utf8)", \$datastring or die "Cannot use C +SV: $!"; my %hash; while (my $row = $csv->getline($io)) { my $count = scalar @{$row}; $count % 2 and die "$count is an odd number of keys and values + at (@{$row}).\n"; $hash{$row->[0]} = $row->[1]; } scalar %hash ? \%hash : undef; }

        Enjoy, Have FUN! H.Merijn
Re: lightweight CSV parser
by jwkrahn (Monsignor) on Dec 22, 2011 at 13:10 UTC
    Any advice appreciated.

    Ok, here goes.



    17 my $opentag = $config->{'moredata_opentag'} ? $config->{'mor +edata_opentag'} : ''; 19 my $closetag = $config->{'moredata_closetag'} ? $config->{'m +oredata_closetag'} : ''; # not required 24 $dataname = defined($val->[0]) ? $val->[0] : '' ; #datanam +e can be empty 25 $format = $val->[1] ? ($val->[1]) : $format_cfg ; 27 $dataname = $val ? $val : '';

    Those are usually written as:

    17 my $opentag = $config->{ moredata_opentag } || ''; 19 my $closetag = $config->{ moredata_closetag } || ''; # not r +equired 24 $dataname = @$val ? $val->[0] : '' ; #dataname can be empt +y 25 $format = $val->[1] || $format_cfg ; 27 $dataname = $val || '';

    And:

    17 my $opentag = $config->{'moredata_opentag'} ? $config->{'mor +edata_opentag'} : ''; 18 die "MoreData requires an open tag: $!" unless ($opentag); 32 my $substrings_aref = _retrieve_strings($str, $opentag, $clo +setag, $dataname); 33 unless (scalar($substrings_aref)) { # no stringrefs returned 34 die "cannot extract substrings from string: $!"; 35 } 41 my $datasep_cfg = $config->{'moredata_datasep'}; 42 length($datasep_cfg) or die "The data separation string must + be configured: $!"; 43 my $hashsep_cfg = $config->{'moredata_hashsep'} || ''; 44 length($hashsep_cfg) or die "The hash separation string must + be configured: $!";

    Would probably be better written as:

    17 my $opentag = $config->{ moredata_opentag } or die "MoreData + requires an open tag\n"; 32 my $substrings_aref = _retrieve_strings($str, $opentag, $clo +setag, $dataname) or die "cannot extract substrings from string\n"; 41 my $datasep_cfg = $config->{ moredata_datasep } or die "The +data separation string must be configured\n"; 43 my $hashsep_cfg = $config->{ moredata_hashsep } or die "The +hash separation string must be configured\n";


    18 die "MoreData requires an open tag: $!" unless ($opentag); 30 die "No format specified in args or configuration: $!" unles +s $format; 33 unless (scalar($substrings_aref)) { # no stringrefs returned 34 die "cannot extract substrings from string: $!"; 35 } 42 length($datasep_cfg) or die "The data separation string must + be configured: $!"; 44 length($hashsep_cfg) or die "The hash separation string must + be configured: $!"; 52 } else { 53 die "no format : $!"; 54 }

    The $! variable only contains useful information directly after the use of a function that accesses the system.    A mathematical, string or boolean expression has no effect on it.

    $substrings_aref is a scalar variable and it is used in a boolean context so the use of scalar there is completely superfluous.



    72 die "Count of $count is not a pair of values in hash assig +nment of values @array : $!" if ($count != (2 || 0));

    Because 2 is a TRUE value the expression is just if $count != 2.    Or did you really want the expression: if $count != 2 || $count != 0?



    98 my @substrings = ($content, $datastring); 102 return \@substrings unless ($stringlength); 105 return \@substrings if ($openposition == -1); 121 @substrings = ($content, $datastring); 122 return \@substrings if ($dataname eq '__data__'); # return c +ontent and data as strings 123 return \@substrings if ($dataname eq '__content__'); # retur +n content and data as strings 124 return \@substrings unless length($datastring); # no reason +to process empty datastring 130 @substrings = ($content, ''); 131 return \@substrings; 144 @substrings = ($content, $datastring); 145 return \@substrings;

    You don't need an array to return an array reference:

    102 return [$content, $datastring] unless ($stringlength); 105 return [$content, $datastring] if ($openposition == -1); 122 return [$content, $datastring] if ($dataname eq '__data__'); + # return content and data as strings 123 return [$content, $datastring] if ($dataname eq '__content__ +'); # return content and data as strings 124 return [$content, $datastring] unless length($datastring); # + no reason to process empty datastring 131 return [$content, '']; 145 return [$content, $datastring];


    148 sub _escape { 149 my $separator = shift; 150 my $regex = '\\' . join('\\', split(/''/, $separator) ); 151 return $regex; 152 }

    I am not sure what this subroutine is supposed to be doing but maybe you should have a look at the quotemeta function or the "\Q...\E" quotemeta escape sequence.    For example:

    64 sub _moredata_hash { 65 my ($datastring, $datasep, $hashsep) = @_; 68 my %hash; 69 my @list = split /\s*\Q$datasep\E\s*/, $datastring; 70 foreach my $item (@list) { 71 my $count = my @array = split(/\s*\Q$hashsep\E\s*/, $item) +; 72 die "Count of $count is not a pair of values in hash assig +nment of values @array : $!" if ($count != (2 || 0)); 73 $hash{$array[0]} = $array[1]; 74 } 75 scalar %hash ? \%hash : undef; 76 }


      Thanks, that was a huge help and very generous of you. Here's what I take away:

      • Don't use the ternary operator when the value tested is the value to be set. Use the || operator.
      • $! contains system call information.
      • It's often better to create an anonymous array directly.
      • Use the quotemeta operator for interpolating strings into regular expressions.
      Thanks a lot!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://944731]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2014-10-25 06:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (142 votes), past polls