Comparison of the parsing features of CSV (and xSV) modules

There are at least six CPAN modules that handle "text delimited formats'', more properly called "text separated formats'' such as CSV (Comma Separated Values). The chart below attempts to compare their parsing properties - how they handle and define the data format options. The chart does *not* show other differences between the modules (e.g. that Text::xSV has specialized formating and printing routines, that Text::CSV_XS has routines related to data typing, that the DBDs support DBI/SQL access to the data formats, etc.).

modules

This comparison covers

 Text::CSV
 Text::xSV
 Text::CSV_XS
 DBD::CSV
 DBD::AnyData
 AnyData

disclaimer

I am the maintainer of the last three modules. If I've inadvertently misrepresented any of the modules, it's out of ignorance, please correct me. My congrats to Tilly, Alan Criterman, and Jochen Wiedmann, authors of the other execellent modules on the list

First some definitions:

CSV
field separator
delimiter char
escape char
record separator
ability to accept embedded newlines
ability to reject embedded newlines
ability to accept embedded binary data
ability to reject embedded binary data
ability to allow sparse delimiting
support for forced delimited writes
null handling
pure perl

The Comparison Chart

First some definitions:

CSV

Comma Separated Values is not a single standard, it refers to a number of slightly different ways to represent data. There is no "Correct CSV'', only CSV that is correct according to the rules of a particular CSV style. "Classic'' CSV, or the kind that many people think of when they talk about CSV is a set of records separated by newlines with the fields of the records separated by commas and the contents of the fields (in some cases) delimited with double quote marks and with a doubled-double-quote as an escape character within fields. But there is AFAIK, no ISO or ANSI or other international standard definining this "classic'' CSV as the one true CSV. All of the CPAN modules which handle CSV formats allow redefinition of the separator character so the format is really *SV, as it includes "tab delimited'' and "pipe delimited'' formats which simply use tabs or pipes in the place where CSV uses commas.

These words form a comma-SEPARATED, period-TERMINATED record with four quote-DELIMITED fields.

 "Just","Another","CSV","Hacker".

field separator

what goes between fields, a comma in classic CSV but e.g. a tab or pipe in "tab delimited'' or "pipe delimited'' formats

delimiter char

what goes around fields, a pair of double quotes in classic CSV, but some modules allow it to be redifined

escape char

the character used to escape the delimiter when it occurs embedded in a filed, a double-quote in classic CSV (e.g. "this, ''"is''" one field'') but some modules allow it to be redefined (e.g. to a backslash)

record separator

what goes between records, a newline in classic CSV, but some modules allow it to be redifined; this can be critical if you are mixing CSV files created on different operating systems without using something like dos2unix to convert them since the newline is different on different OSs; alternate record separators also allow data in "vertical" formats e.g. where a newline is a field separator and a double newline is a record separator

ability to accept embedded newlines

the ability to use the newline character inside a field, obviously critical if your data has newlines

ability to reject embedded newlines

sometimes this is the desired behaviour, e.g. if you are prepping data for another program which won't accept embedded newlines

accept embedded binary data

the ability to use binary data (e.g. NULL chars or ^L) embedded in fields

reject embedded binary data

again, sometimes this is the desired behaviour - if you are prepping for a program that won't accept binary data, you want the parser to fail on parsing

ability to allow sparse delimiting

classic CSV uses sparse delimiting - it uses delimiters only around fields that need them, e.g. those fields that have embedded commas, newlines, or quotes; with sparse delimiting this is a valid 3-field record: foo,"bar,bop'',7

support for forced delimited writes

but some CSV styles always use delimiters for all fields, so some modules support forcing delimiters onto all fields or onto all non-numeric fields

null differentiated from empty

Text::xSV differentiates between null (undefined values) and an empty string. The other modules treat them the same.

pure perl

some of the modules are pure-perl and therefore can be installed without compilation, others have C/XS componenents and require a compilation on a specific platform; the C/XS modules are generally faster than the pure perl modules

The Comparison Chart

A plus mark indicates the presence of a feabugir (feature or bug or irrelevant, depending on the context), not necessarily that it is "better'' than a minus mark.

Text::CSV Text::xSV Text::CSV_XS DBD::CSV AnyData

accept newlines - + + + *

reject newlines + - + - +

accept embedded binary - + + + +

reject embedded binary + - + - -

forced delimiting + - + - -

sparse delimiting - + + + +

user-defined field sep - + + + +

user-defined delimiter - - + + +

user-defined escape - - + + +

user-defined record sep - - + + +

pure perl + + - - +

null handling - + - - -

	Text::CSV	Text::xSV	Text::CSV_XS	DBD::CSV	AnyData
accept newlines	-	+	+	+	*
reject newlines	+	-	+	-	+
accept embedded binary	-	+	+	+	+
reject embedded binary	+	-	+	-	-
forced delimiting	+	-	+	-	-
sparse delimiting	-	+	+	+	+
user-defined field sep	-	+	+	+	+
user-defined delimiter	-	-	+	+	+
user-defined escape	-	-	+	+	+
user-defined record sep	-	-	+	+	+
pure perl	+	+	-	-	+
null handling	-	+	-	-	-

Notes
Some of the modules accept flags which can change their default behaviour, e.g. Text::CSV_XS defaults to rejecting newlines but can easily be set to accept them by passing the "binary'' flag. In these cases, they are shown with plus marks for all possible settings.

DBD::AnyData has the same properties as AnyData (which is a multi-level tied-hash interface to the data), both accept embedded newlines only if something other than newline is used as the record separator

DBD::CSV is actually built on top of Text::CSV_XS but since it uses specific flags for Text::CSV_XS, its parsing properties are somewhat different.

update added readmore tags update2 added null handling

Back to Meditations