Given small CSV data files or big(ger) CSV data files with a filter
so that all of the data fits into memory, the Text::CSV_XS'
csv function will most
likely accomodate the common usage:
use Text::CSV_XS qw( csv );
my $aoa = csv (in => "file.csv");
This function also supports the common attributes for
new:
my $aoa = csv (in => "file.csv", sep_char => ";");
or even with shortcuts and aliasses:
my $aoa = csv (in => "file.csv", sep => ";");
If there is lots to process inside each row, not all rows would
fit into memory, or the callback structure and options for csv
will obscure the code, reverting to the low level interface is the
only way to go:
use autodie;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new (
binary => 1,
auto_diag => 1,
sep_char => ";",
});
open my $fh, "<", "file.csv";
while (my $row = $csv->getline ($fh)) {
# do something with the row
}
close $fh;
Quite often a CSV data source has got one header line that holds
the column names, which is easy to ask for in the csv funtion:
# Default: return a list of lists (rows)
my $aoa = csv (in => "file.csv");
# Using the header line: return a list of hashes (records)
my $aoh = csv (in => "file.csv", headers => "auto");
Or in low-level
open my $fh, "<", "file.csv";
my @hdr = @{$csv->getline ($fh)};
$csv->column_names (@hdr);
while (my $row = $csv->getline_hr ($fh)) {
...
This week I was confronted with a set of CSV files where the separator
character was changing based on the content of the file. Oh, the horror!
If the CSV file was expected to contain amounts, the program that did the
export chose to use a ; separator and in other cases it used the
default ,. IMHO the person that decided to do this should be fired
without even blinking the eye.
This implied that on opening the CSV data stream, I - as a consumer -
had to know in advance what this specific file would be like. Which made
me come up with a new thought:
"If a CSV stream is supposed to have a header line that definess
the column names, it is (very) unlikely that the column names will contain
unpleasant characters like embedded newlines, semi-colons, or comma's.
Remember, these are column names, not data rows. Not that it is prohibited
to have header fields that have comma's or other non-word characters, but
let us assume that it is uncommon enough to warrant support for easy of
use."
So I wanted to convert this:
open my $fh, "<", "file.csv";
my @hdr = @{$csv->getline ($fh)};
$csv->column_names (@hdr);
while (my $row = $csv->getline_hr ($fh)) {
where the $csv instance has to know what the separator is, to
open my $fh, "<", "file.csv";
my @hdr = $csv->header ($fh);
$csv->column_names (@hdr);
while (my $row = $csv->getline_hr ($fh)) {
which will do the same, but also detect and set the separator.
where the new header method will read the first line of the
already opened stream, detect the separator based on a default list of
allowed separators, use the detected sparator to set sep_char for
given $csv instance and use it to parse the line and return the
result as a list.
As this came to me as common practice, before you parse the rest of
your CSV, I came up with a local method (not (yet) in Text::CSV_XS) that
does this for me:
sub Text::CSV_XS::header {
my ($csv, $fh, $seps) = @_;
my $hdr = lc <$fh> or return;
foreach my $sep (@{$seps || [ ";", "," ]}) {
index $hdr, $sep < 0 and next;
$csv->sep_char ($sep);
last;
}
open my $h, "<", \$hdr;
my $row = $csv->getline ($h);
close $h;
@{$row // []};
} # Text::CSV_XS::header
it even has some documentation :)
=head2 $csv->header ($fh)
Return the CSV header and set C<sep_char>.
my @hdr = $csv->header ($fh);
my @hdr = $csv->header ($fh, [ ";", ",", "|", "\t" ]);
Assuming that the file opened for parsing has a header, and the header
does not contain problematic characters like embedded newlines, read
the first line from the open handle, auto-detect whether the header
separates the column names with a character from the allowed separator
list. That list defaults to C<[ ";", "," ]> and can be overruled with
an optional second argument. If any of the allowed separators matches
(checks are done in order), set C<sep_char> to that sequence for the
current CSV_XS instance and use it to parse the first line and return
it as an array where all fields are mapped to lower case:
my $csv = Text::CSV_XS->new ({ binary => 1, auto_diag => 1 });
open my $fh, "<:encoding(iso-8859-1)", "file.csv";
my @hdr = $csv->header ($fh) or die "file.csv has no header line\n";
# $csv now has the correct sep_char
while (my $row = $csv->getline ($fh)) {
...
}
After two days of intensive use, I thought this might be useful to
add to Text::CSV_XS so we all can profit, but I want to get it
right from the start, so I ask for feedback (already got some from
our local PM group)
Let the bikeshedding commence ...
- Is this functionality useful enough to add at all
- is $csv->header a useful method name (remember we also
have low level methods to deal with hashes, like
$csv->column_names)
- Is the proposed API sufficient
- Do you see any shortcomings
Things I envision in this function is to also auto-detect
encoding when the line includes a BOM and set it to the stream using
binmode or have some option to allow this new method to not
only return the headers, but use them to set the column names:
#---
my $data = "foo,bar\r\n1,baz\r\n";
open my $fh, "<", \$data;
my @hdr = $csv->header ($fh); # ("foo", "bar")
while (my $row = $csv->getline ($fh)) {
# $row = [ "1", "baz" ]
#---
my $data = "foo;bar\r\n1;baz\r\n";
open my $fh, "<", \$data;
my @hdr = $csv->header ($fh); # ("foo", "bar")
$csv->column_names (@hdr);
while (my $row = $csv->getline_hr ($fh)) {
# $row = { foo => "1", bar => "baz" }
#---
my $data = "foo|bar\r\n1|baz\r\n";
open my $fh, "<", \$data;
$csv->column_names ($csv->header ($fh, [ ";", ",", "|" ]));
while (my $row = $csv->getline_hr ($fh)) {
# $row = { foo => "1", bar => "baz" }
Enjoy, Have FUN! H.Merijn
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.