Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Comparison of the parsing features of CSV (and xSV) modules

by jZed (Prior)
on Jun 14, 2004 at 16:01 UTC ( #366587=perlmeditation: print w/ replies, xml ) Need Help??

There are at least six CPAN modules that handle "text delimited formats'', more properly called "text separated formats'' such as CSV (Comma Separated Values). The chart below attempts to compare their parsing properties - how they handle and define the data format options. The chart does *not* show other differences between the modules (e.g. that Text::xSV has specialized formating and printing routines, that Text::CSV_XS has routines related to data typing, that the DBDs support DBI/SQL access to the data formats, etc.).

modules

This comparison covers

 Text::CSV
 Text::xSV
 Text::CSV_XS
 DBD::CSV
 DBD::AnyData
 AnyData

disclaimer

I am the maintainer of the last three modules. If I've inadvertently misrepresented any of the modules, it's out of ignorance, please correct me. My congrats to Tilly, Alan Criterman, and Jochen Wiedmann, authors of the other execellent modules on the list



First some definitions:

CSV

Comma Separated Values is not a single standard, it refers to a number of slightly different ways to represent data. There is no "Correct CSV'', only CSV that is correct according to the rules of a particular CSV style. "Classic'' CSV, or the kind that many people think of when they talk about CSV is a set of records separated by newlines with the fields of the records separated by commas and the contents of the fields (in some cases) delimited with double quote marks and with a doubled-double-quote as an escape character within fields. But there is AFAIK, no ISO or ANSI or other international standard definining this "classic'' CSV as the one true CSV. All of the CPAN modules which handle CSV formats allow redefinition of the separator character so the format is really *SV, as it includes "tab delimited'' and "pipe delimited'' formats which simply use tabs or pipes in the place where CSV uses commas.

These words form a comma-SEPARATED, period-TERMINATED record with four quote-DELIMITED fields.

 "Just","Another","CSV","Hacker".

field separator

what goes between fields, a comma in classic CSV but e.g. a tab or pipe in "tab delimited'' or "pipe delimited'' formats

delimiter char

what goes around fields, a pair of double quotes in classic CSV, but some modules allow it to be redifined

escape char

the character used to escape the delimiter when it occurs embedded in a filed, a double-quote in classic CSV (e.g. "this, ''"is''" one field'') but some modules allow it to be redefined (e.g. to a backslash)

record separator

what goes between records, a newline in classic CSV, but some modules allow it to be redifined; this can be critical if you are mixing CSV files created on different operating systems without using something like dos2unix to convert them since the newline is different on different OSs; alternate record separators also allow data in "vertical" formats e.g. where a newline is a field separator and a double newline is a record separator

ability to accept embedded newlines

the ability to use the newline character inside a field, obviously critical if your data has newlines

ability to reject embedded newlines

sometimes this is the desired behaviour, e.g. if you are prepping data for another program which won't accept embedded newlines

accept embedded binary data

the ability to use binary data (e.g. NULL chars or ^L) embedded in fields

reject embedded binary data

again, sometimes this is the desired behaviour - if you are prepping for a program that won't accept binary data, you want the parser to fail on parsing

ability to allow sparse delimiting

classic CSV uses sparse delimiting - it uses delimiters only around fields that need them, e.g. those fields that have embedded commas, newlines, or quotes; with sparse delimiting this is a valid 3-field record: foo,"bar,bop'',7

support for forced delimited writes

but some CSV styles always use delimiters for all fields, so some modules support forcing delimiters onto all fields or onto all non-numeric fields

null differentiated from empty

Text::xSV differentiates between null (undefined values) and an empty string. The other modules treat them the same.

pure perl

some of the modules are pure-perl and therefore can be installed without compilation, others have C/XS componenents and require a compilation on a specific platform; the C/XS modules are generally faster than the pure perl modules


The Comparison Chart

A plus mark indicates the presence of a feabugir (feature or bug or irrelevant, depending on the context), not necessarily that it is "better'' than a minus mark.

Text::CSV Text::xSV Text::CSV_XS DBD::CSVAnyData
accept newlines - + + + *
reject newlines + - + - +
accept embedded binary - + + + +
reject embedded binary + - + - -
forced delimiting + - + - -
sparse delimiting - + + + +
user-defined field sep - + + + +
user-defined delimiter- - + + +
user-defined escape- - + + +
user-defined record sep- - + + +
pure perl+ + - - +
null handling- + - - -

Notes
Some of the modules accept flags which can change their default behaviour, e.g. Text::CSV_XS defaults to rejecting newlines but can easily be set to accept them by passing the "binary'' flag. In these cases, they are shown with plus marks for all possible settings.

DBD::AnyData has the same properties as AnyData (which is a multi-level tied-hash interface to the data), both accept embedded newlines only if something other than newline is used as the record separator

DBD::CSV is actually built on top of Text::CSV_XS but since it uses specific flags for Text::CSV_XS, its parsing properties are somewhat different.

update added readmore tags update2 added null handling

Comment on Comparison of the parsing features of CSV (and xSV) modules
Re: Comparison of the parsing features of CSV (and xSV) modules
by dragonchild (Archbishop) on Jun 14, 2004 at 16:41 UTC
    You missed a few features of Text::xSV
    • It grabs lines from the file based on $/. If you set that differently, it will set the record separator differently.
    • It will allow you to specify whether or not a record has the correct number of fields.
    • It will allow you to retrieve the data based on the column headings in a hash, instead of always in an array
    • It will allow you to write based on a hash, given the column headings, instead of always in an array

    Adding a few features shouldn't be difficult. Specifically:

    • Reject newlines
    • user-defined delimiter
    • user-defined escape
    • forced-delimiting

    tilly?

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      You missed a few features of Text::xSV
      It wasn't my intention to make a full comparison of the features, only of the CSV parsing. Like Text::xSV, AnyData and the DBDs also have many options for handling the resulting data structures, it would take a much lengthier meditation to compare all of those.
      Adding a few features shouldn't be difficult.
      Those features may or may not be worthwhile in Text::xSV. I didn't intend the minus signs in the chart to necessarily indicate something the modules should support. Then again if Tilly does add the user-defined features, I can use Text::xSV as a backend for DBD::AnyData so that there would be a DBI/SQL interface to his excellent module.
        The problem is not adding the features, it is figuring out the API for offering them, and seeing the need.

        Which of the above user-defined features does DBD::AnyData need? All of them? Some? We probably should offline a discussion of how to provide them in a sane manner.

        Some are already present in some form. For instance Text::xSV offers the ability to pre-filter input as it comes in. There is nothing to stop such a filter from saying, "These characters are not allowed." That won't work for eliminating embedded tabs in tab-delimited data. But it would for NUL characters. And if you want to disallow tabs embedded in a tab-delimited file, you could just wrap Text::xSV with the necessary validation logic.

        Also as dragonchild points out, it reads based on $/ which can effectively allow you to change the record separator. Furthermore because I got tired of bug reports about invalid csv files, by default Text::xSV will treat either \n or \r\n as a newline, so files produced on Windows can be read on Unix and vice versa.

        Some of the other missing features can't presently be worked around. For me the barrier isn't adding them if I thought people would use them, it is figuring out what a reasonable API should be.
Re: Comparison of the parsing features of CSV (and xSV) modules
by hossman (Prior) on Jun 14, 2004 at 17:00 UTC

    Great chart, Three suggestions...

    • break out "accept newlines" into two seperate qulifications: "accept newlines with other record seperator" and "accept newlines with newline seperator" ... that will eliminate the need for the "*" and teh special footnote.
    • It might be usefull for people who aren't intimately familiar with *SV to see a simple sample set fo each of the cases described in that table.
    • Given a sample set for each case, a small sample program for each module showing how to parse that data would provide both a good example of how to use the module, as well as provide benchmark critera. You could then concievable replace those "+" signs with timing info.

    (All of which is just to say: I thought about doing this myself and posting as a followup, but then I remembered I'm at work, and I'm planning on leaving early today, so I really can't justify spending that kind of time ... but I thought I'd through it out in case anyone else thought it sounded like a good idea and wanted to take a stab at it.)

Re: Comparison of the parsing features of CSV (and xSV) modules
by perrin (Chancellor) on Jun 14, 2004 at 17:37 UTC
    Nice comparison. A small lowest-common-denominator benchmark would be cool too.
Re: Comparison of the parsing features of CSV (and xSV) modules
by bsb (Priest) on Jun 15, 2004 at 01:57 UTC
    I'd like to see an interface style row on your comparision.

    Text::CSV_XS has such a crummy interface for both input and output. All that error flags checking and unnecessary OO. Are the others any better?

    I want something as simple as split & join, that just dies on bad input and is configurable via package vars or a parser object. It seems like the Text::CSV_XS interface was not driven by "make simple things simple". It probably only needs a wrapper module.

    (Seems like AnyData and DBD provide useful interfaces for their respective domains though)

Re: Comparison of the parsing features of CSV (and xSV) modules
by dbush (Deacon) on Jun 15, 2004 at 06:45 UTC

    Great node jZed. Definitely a ++ required.

    One additional thing that would be useful, for me at least, is how each handles the third field for the following inputs:

    "Just","Another",,"Hacker"
    "Just","Another","","Hacker"
    

    My view of the standard is that the first example should give undef and the second should be an empty string.

    Regards,
    Dom.

Re: Comparison of the parsing features of CSV (and xSV) modules
by Wally Hartshorn (Friar) on Jun 15, 2004 at 14:12 UTC

    Great post! I obviously have an interest in this given my recent self-inflicted problems with processing a CSV file. :-)

    About the only thing missing from most (all?) of these modules is a way to handle embedded, unescaped delimiter characters. (Wouldn't clean input data be nice? *sigh*) Perhaps setting the delimiter character to the empty string would trigger a separate set of logic that would handle that case. (A delimiter character followed by a separator character or a newline would be a real closing delimiter character, while others would be ignored, perhaps.)

    Not that I'm volunteering, of course. :-)

    Wally Hartshorn

      What would be some example data, how it's currently being parsed, and how you'd like it to be parsed?

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      I shouldn't have to say this, but any code, unless otherwise stated, is untested

        Here's an example:

        "Smith","John",12/31/1962,"Author of "How to Break Programs" and other books","Bugger"

        I'm using a series of (somewhat fragile) regexes to change that to:

        "Smith","John",12/31/1962,"Author of ""How to Break Programs"" and other books","Bugger"

        Wally Hartshorn

Re: Comparison of the parsing features of CSV (and xSV) modules
by Tux (Monsignor) on May 08, 2007 at 13:37 UTC

    Just for the record, Text::CSV_XS does know the difference between NULL/undef and empty strings. This will show when using always_quote => 1:

       my $csv = Text::CSV_XS->new ({ always_quote => 1});
       $csv->combine ("", 1, undef, "bar");
       print $csv->string, "\r\n";

    Will result in:

       "","1",,"bar"

    Note the missing quotes for the undef entry

    If requested, I can consider adding an option to do the reverse also: when always_quoted is true, return empty columns as undef in the $csv->fields call.


    Enjoy, Have FUN! H.Merijn
Re: Comparison of the parsing features of CSV (and xSV) modules
by Tux (Monsignor) on Feb 19, 2008 at 12:54 UTC
Re: Comparison of the parsing features of CSV (and xSV) modules
by Tux (Monsignor) on Mar 01, 2008 at 16:57 UTC

    With the release of Text-CSV_XS 0.35 (Valloire), I think I have addressed the undef question enough to be really useful:

    > perl -MData::Dumper -MText::CSV_XS -l \ -e 'my $csv=Text::CSV_XS->new ({ blank_is_undef => 1 });' \ -e '$csv->parse (q{1,,"", });' \ -e 'print Dumper$csv->fields' $VAR1 = '1'; $VAR2 = undef; $VAR3 = ''; $VAR4 = ' ';

    Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://366587]
Approved by thraxil
Front-paged by perrin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (20)
As of 2014-09-16 14:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (24 votes), past polls