Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Given small CSV data files or big(ger) CSV data files with a filter so that all of the data fits into memory, the Text::CSV_XS' csv function will most likely accomodate the common usage:

use Text::CSV_XS qw( csv ); my $aoa = csv (in => "file.csv");

This function also supports the common attributes for new:

my $aoa = csv (in => "file.csv", sep_char => ";");

or even with shortcuts and aliasses:

my $aoa = csv (in => "file.csv", sep => ";");

If there is lots to process inside each row, not all rows would fit into memory, or the callback structure and options for csv will obscure the code, reverting to the low level interface is the only way to go:

use autodie; use Text::CSV_XS; my $csv = Text::CSV_XS->new ( binary => 1, auto_diag => 1, sep_char => ";", }); open my $fh, "<", "file.csv"; while (my $row = $csv->getline ($fh)) { # do something with the row } close $fh;

Quite often a CSV data source has got one header line that holds the column names, which is easy to ask for in the csv funtion:

# Default: return a list of lists (rows) my $aoa = csv (in => "file.csv"); # Using the header line: return a list of hashes (records) my $aoh = csv (in => "file.csv", headers => "auto");

Or in low-level

open my $fh, "<", "file.csv"; my @hdr = @{$csv->getline ($fh)}; $csv->column_names (@hdr); while (my $row = $csv->getline_hr ($fh)) { ...

This week I was confronted with a set of CSV files where the separator character was changing based on the content of the file. Oh, the horror! If the CSV file was expected to contain amounts, the program that did the export chose to use a ; separator and in other cases it used the default ,. IMHO the person that decided to do this should be fired without even blinking the eye.

This implied that on opening the CSV data stream, I - as a consumer - had to know in advance what this specific file would be like. Which made me come up with a new thought:

"If a CSV stream is supposed to have a header line that definess the column names, it is (very) unlikely that the column names will contain unpleasant characters like embedded newlines, semi-colons, or comma's. Remember, these are column names, not data rows. Not that it is prohibited to have header fields that have comma's or other non-word characters, but let us assume that it is uncommon enough to warrant support for easy of use."

So I wanted to convert this:

open my $fh, "<", "file.csv"; my @hdr = @{$csv->getline ($fh)}; $csv->column_names (@hdr); while (my $row = $csv->getline_hr ($fh)) {

where the $csv instance has to know what the separator is, to

open my $fh, "<", "file.csv"; my @hdr = $csv->header ($fh); $csv->column_names (@hdr); while (my $row = $csv->getline_hr ($fh)) {

which will do the same, but also detect and set the separator.

where the new header method will read the first line of the already opened stream, detect the separator based on a default list of allowed separators, use the detected sparator to set sep_char for given $csv instance and use it to parse the line and return the result as a list.

As this came to me as common practice, before you parse the rest of your CSV, I came up with a local method (not (yet) in Text::CSV_XS) that does this for me:

sub Text::CSV_XS::header { my ($csv, $fh, $seps) = @_; my $hdr = lc <$fh> or return; foreach my $sep (@{$seps || [ ";", "," ]}) { index $hdr, $sep < 0 and next; $csv->sep_char ($sep); last; } open my $h, "<", \$hdr; my $row = $csv->getline ($h); close $h; @{$row // []}; } # Text::CSV_XS::header

it even has some documentation :)

=head2 $csv->header ($fh) Return the CSV header and set C<sep_char>. my @hdr = $csv->header ($fh); my @hdr = $csv->header ($fh, [ ";", ",", "|", "\t" ]); Assuming that the file opened for parsing has a header, and the header does not contain problematic characters like embedded newlines, read the first line from the open handle, auto-detect whether the header separates the column names with a character from the allowed separator list. That list defaults to C<[ ";", "," ]> and can be overruled with an optional second argument. If any of the allowed separators matches (checks are done in order), set C<sep_char> to that sequence for the current CSV_XS instance and use it to parse the first line and return it as an array where all fields are mapped to lower case: my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 }); open my $fh, "<:encoding(iso-8859-1)", "file.csv"; my @hdr = $csv->header ($fh) or die "file.csv has no header line\n"; # $csv now has the correct sep_char while (my $row = $csv->getline ($fh)) { ... }

After two days of intensive use, I thought this might be useful to add to Text::CSV_XS so we all can profit, but I want to get it right from the start, so I ask for feedback (already got some from our local PM group)

Let the bikeshedding commence ...

  • Is this functionality useful enough to add at all
  • is $csv->header a useful method name (remember we also have low level methods to deal with hashes, like $csv->column_names)
  • Is the proposed API sufficient
  • Do you see any shortcomings

Things I envision in this function is to also auto-detect encoding when the line includes a BOM and set it to the stream using binmode or have some option to allow this new method to not only return the headers, but use them to set the column names:

#--- my $data = "foo,bar\r\n1,baz\r\n"; open my $fh, "<", \$data; my @hdr = $csv->header ($fh); # ("foo", "bar") while (my $row = $csv->getline ($fh)) { # $row = [ "1", "baz" ] #--- my $data = "foo;bar\r\n1;baz\r\n"; open my $fh, "<", \$data; my @hdr = $csv->header ($fh); # ("foo", "bar") $csv->column_names (@hdr); while (my $row = $csv->getline_hr ($fh)) { # $row = { foo => "1", bar => "baz" } #--- my $data = "foo|bar\r\n1|baz\r\n"; open my $fh, "<", \$data; $csv->column_names ($csv->header ($fh, [ ";", ",", "|" ])); while (my $row = $csv->getline_hr ($fh)) { # $row = { foo => "1", bar => "baz" }

Enjoy, Have FUN! H.Merijn

In reply to CSV headers. Feedback wanted by Tux

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (6)
As of 2024-04-24 11:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found