|Perl: the Markov chain saw|
Comparison of the parsing features of CSV (and xSV) modulesby jZed (Prior)
|on Jun 14, 2004 at 16:01 UTC||Need Help??|
There are at least six CPAN modules that handle "text delimited formats'', more properly called "text separated formats'' such as CSV (Comma Separated Values). The chart below attempts to compare their parsing properties - how they handle and define the data format options. The chart does *not* show other differences between the modules (e.g. that Text::xSV has specialized formating and printing routines, that Text::CSV_XS has routines related to data typing, that the DBDs support DBI/SQL access to the data formats, etc.).
This comparison covers
I am the maintainer of the last three modules. If I've inadvertently misrepresented any of the modules, it's out of ignorance, please correct me. My congrats to Tilly, Alan Criterman, and Jochen Wiedmann, authors of the other execellent modules on the list
Comma Separated Values is not a single standard, it refers to a number of slightly different ways to represent data. There is no "Correct CSV'', only CSV that is correct according to the rules of a particular CSV style. "Classic'' CSV, or the kind that many people think of when they talk about CSV is a set of records separated by newlines with the fields of the records separated by commas and the contents of the fields (in some cases) delimited with double quote marks and with a doubled-double-quote as an escape character within fields. But there is AFAIK, no ISO or ANSI or other international standard definining this "classic'' CSV as the one true CSV. All of the CPAN modules which handle CSV formats allow redefinition of the separator character so the format is really *SV, as it includes "tab delimited'' and "pipe delimited'' formats which simply use tabs or pipes in the place where CSV uses commas.
These words form a comma-SEPARATED, period-TERMINATED record with four quote-DELIMITED fields.
what goes between fields, a comma in classic CSV but e.g. a tab or pipe in "tab delimited'' or "pipe delimited'' formats
what goes around fields, a pair of double quotes in classic CSV, but some modules allow it to be redifined
the character used to escape the delimiter when it occurs embedded in a filed, a double-quote in classic CSV (e.g. "this, ''"is''" one field'') but some modules allow it to be redefined (e.g. to a backslash)
what goes between records, a newline in classic CSV, but some modules allow it to be redifined; this can be critical if you are mixing CSV files created on different operating systems without using something like dos2unix to convert them since the newline is different on different OSs; alternate record separators also allow data in "vertical" formats e.g. where a newline is a field separator and a double newline is a record separator
the ability to use the newline character inside a field, obviously critical if your data has newlines
sometimes this is the desired behaviour, e.g. if you are prepping data for another program which won't accept embedded newlines
the ability to use binary data (e.g. NULL chars or ^L) embedded in fields
again, sometimes this is the desired behaviour - if you are prepping for a program that won't accept binary data, you want the parser to fail on parsing
classic CSV uses sparse delimiting - it uses delimiters only around fields that need them, e.g. those fields that have embedded commas, newlines, or quotes; with sparse delimiting this is a valid 3-field record: foo,"bar,bop'',7
but some CSV styles always use delimiters for all fields, so some modules support forcing delimiters onto all fields or onto all non-numeric fields
some of the modules are pure-perl and therefore can be installed without compilation, others have C/XS componenents and require a compilation on a specific platform; the C/XS modules are generally faster than the pure perl modules
A plus mark indicates the presence of a feabugir (feature or bug or irrelevant, depending on the context), not necessarily that it is "better'' than a minus mark.
DBD::AnyData has the same properties as AnyData (which is a multi-level tied-hash interface to the data), both accept embedded newlines only if something other than newline is used as the record separator
DBD::CSV is actually built on top of Text::CSV_XS but since it uses specific flags for Text::CSV_XS, its parsing properties are somewhat different.update added readmore tags update2 added null handling