Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

DBD::CSV - how to I coax it to read BOM prefixed files?

by ELISHEVA (Prior)
on May 28, 2017 at 21:06 UTC ( #1191448=perlquestion: print w/replies, xml ) Need Help??

ELISHEVA has asked for the wisdom of the Perl Monks concerning the following question:

I have a bom prefixed file that I'd like DBD::CVS (v. 0.49) to read and query, but when I tried I kept getting the error message Execution ERROR: Missing first row due to EIF - Loose unescaped quote . But there are no unquoted fields and the only thing I can think is happening is that it sees the BOM and thinks the first field of the first row begins without a quote. (For those not familiar with the DBD::CSV documentation "Loose unescaped quotes" are quotes embedded in a field that begins without a quote).

I'm fairly certain the problem is the BOM because if I remove the BOM, the file can be read and loaded without error

What's puzzling is that the underlying Text::CSV_XS module is BOM friendly. I've tried a variety of things to no avail, among them:

  • setting f_encoding => 'utf8'
  • setting csv_detect_bom => 1 and detect_bom => 1 in hopes that it would be passed onto the underlying Text::CSV_XS module
  • using an explicit column list (via col_names) and skipping the first line via csv_skip_first_row
  • setting csv_allow_loose_quotes => 1. That suppresses the EIF error message but replaces it with complaints that the column name stored in the first field in the first line doesn't exist - adding to my suspicions that it thinks the BOM is part of the field name

I had considered just removing the BOM and ignoring the problem, but this isn't a one-off. A data source I don't control is producing these BOM prefixed files. I'd rather not be manually munging files (even with a script), every few months. Furthermore the source, or one of them, is the OECD. They produce a lot of data and I'd like that data to be accessible to other Perl programmers in CSV format, either through the solutions we discuss on this thread or because DBD::CSV has been enhanced to be BOM friendly.

Many thanks in advance

Replies are listed 'Best First'.
Re: DBD::CSV - how to I coax it to read BOM prefixed files?
by afoken (Canon) on May 28, 2017 at 21:54 UTC

    As I wrote in the CB, I think File::BOM might be the way to do it. I'm still searching, but at least I've a working example that shows the error:

    #!/usr/bin/perl use v5.12; use warnings; use autodie qw( :all ); use File::BOM; use DBI; open my $h,'>:encoding(utf-8)','test.csv'; say $h qq<\x{FEFF}"foo","bar","baz">; say $h qq<"1","2","3">; say $h qq<"4","5","6">; close $h; my $dbh=DBI->connect('dbi:CSV:',undef,undef,{ RaiseError => 1, PrintEr +ror => 0, f_ext => '.csv'}); my $sth=$dbh->prepare('select * from test'); $sth->execute(); while (my @a=$sth->fetchrow_array()) { say join(",",@a); } $sth->finish();

    Update:

    It seems that only the header() method in Text::CSV_XS is able to handle the BOM, and DBD::CSV does not call that method. Instead, it calls getline().

    #!/usr/bin/perl use v5.12; use warnings; use autodie qw( :all ); use File::BOM; use DBI; use Data::Dumper; package My::Text::CSV_XS { use parent "Text::CSV_XS"; sub DUMP { return Data::Dumper->new([\@_],['*_'])->Sortkeys(1)->I +ndent(1)->Useqq(1)->Dump(); } sub new { my $proto=shift; say "$proto -> new(",DUMP(@_),")"; $proto->SUPER::new(@_); } sub header { my $self=shift; say "$self -> header(",DUMP(@_),")"; $self->SUPER::header(@_); } sub getline { my $self=shift; say "$self -> getline(",DUMP(@_),")"; $self->SUPER::getline(@_); } sub getline_hr { my $self=shift; say "$self -> getline_hr(",DUMP(@_),")"; $self->SUPER::getline_hr(@_); } sub getline_all { my $self=shift; say "$self -> getline(",DUMP(@_),")"; $self->SUPER::getline_all(@_); } sub getline_hr_all { my $self=shift; say "$self -> getline_hr(",DUMP(@_),")"; $self->SUPER::getline_hr_all(@_); } } open my $h,'>:encoding(utf-8)','test.csv'; say $h qq<\x{FEFF}"foo","bar","baz">; say $h qq<"1","2","3">; say $h qq<"4","5","6">; close $h; my $dbh=DBI->connect('dbi:CSV:',undef,undef,{ RaiseError => 1, PrintEr +ror => 0, f_ext => '.csv', csv_class => 'My::Text::CSV_XS'}); my $sth=$dbh->prepare('select * from test'); $sth->execute(); while (my @a=$sth->fetchrow_array()) { say join(",",@a); } $sth->finish();
    >perl test2.pl My::Text::CSV_XS -> new(@_ = ( { "auto_diag" => 1, "binary" => 1, "escape_char" => "\"", "quote_char" => "\"", "sep_char" => "," } ); ) My::Text::CSV_XS -> new(@_ = ( { "auto_diag" => 1, "binary" => 1, "eol" => "\r\n", "escape_char" => "\"", "quote_char" => "\"", "sep_char" => "," } ); ) My::Text::CSV_XS=HASH(0x23e04e8) -> getline(@_ = ( bless( \*Symbol::GEN1, 'IO::File' ) ); ) DBD::CSV::st execute failed: Execution ERROR: Missing first row due to EIF - Loose unescaped quote +at /usr/lib64/perl5/vendor_perl/DBI/DBD/SqlEngine.pm line 1480. . at /usr/lib64/perl5/vendor_perl/DBI/DBD/SqlEngine.pm line 1271. [for Statement "select * from test"] at test2.pl line 71, <GEN1> line + 1. >

    Second Update:

    There is an ugly way to get File::BOM into DBD::CSV. DBD::CSV inherits from DBD::File, and DBD::File has an f_encoding attribute that is simply wrapped in :encoding(...), without further processing. That string is then passed to binmode.

    This is the relevant part of DBD::File:

    sub apply_encoding { my ($self, $meta, $fn) = @_; defined $fn or $fn = "file handle " . fileno ($meta->{fh}); if (my $enc = $meta->{f_encoding}) { binmode $meta->{fh}, ":encoding($enc)" or croak "Failed to set encoding layer '$enc' on $fn: $!"; } else { binmode $meta->{fh} or croak "Failed to set binary mode on $fn: $! +"; } } # apply_encoding

    So, with a carefully crafted encoding string, it works at least on my system:

    #!/usr/bin/perl use v5.12; use warnings; use autodie qw( :all ); use File::BOM; use DBI; open my $h,'>:encoding(utf-8)','test.csv'; say $h qq<\x{FEFF}"foo","bar","baz">; say $h qq<"1","2","3">; say $h qq<"4","5","6">; close $h; my $dbh=DBI->connect( 'dbi:CSV:', undef, undef, { RaiseError => 1, PrintError => 0, f_ext => '.csv', f_encoding => 'utf-8):via(File::BOM' # <-- this is the + evil trick! } ); my $sth=$dbh->prepare('select * from test'); $sth->execute(); while (my @a=$sth->fetchrow_array()) { say join(",",@a); } $sth->finish();

    Effectively, this calls binmode $meta->{fh}, ":encoding(utf-8):via(File::BOM)", and so, File::BOM can hide the BOM from Text::CSV_XS and DBD::CSV.

    >perl test3.pl 1,2,3 4,5,6 >

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      Worked like a charm. Thank-you

        Worked like a charm. Thank-you

        Glad I could help.

        BUT:

        Until Tux' change is released on CPAN, please do yourself a favour and add an explaining comment how that trick works. This trick is everything but obvious, and I doubt that it will work for any combination of the relevant modules. Don't make your future self invent a time machine to come back and hurt you for playing dirty tricks.

        When the change is released, test it, get rid of my dirty trick, and explicitly require the new version of DBD::CSV.

        If your code has to work with "any" version of DBD::CSV, check its version, prefer to use the Tux' official csv_bom attribute, and only if it is not supported, fall back to use the File::BOM trick. Something like this:

        use DBD::CSV (); use if $DBD::CSV::VERSION < 0.50, File::BOM => (); # ^---- adjust to working version number my $dbh=DBI->connect( 'dbi:CSV:', undef, undef, { RaiseError => 1, PrintError => 0, f_ext => '.csv', ( # v---- adjust to working version numbe +r $DBD::CSV::VERSION < 0.50 ? (f_encoding => 'utf-8):via(File::BOM') : (csv_bom => 1) ) } );

        Alternatively, patch DBD::CSV at runtime if it is too old, and use Tux' official attribute csv_bom even with old versions:

        #!/usr/bin/perl use v5.12; use warnings; use autodie qw( :all ); use DBI; use DBD::CSV (); # v----- adjust to working version if ($DBD::CSV::VERSION < 0.50) { require File::BOM; my $old_connect=\&DBD::CSV::dr::connect; my $new_connect=sub { my ($drh, $dbname, $user, $auth, $attr) = @_; if ($attr && exists($attr->{'csv_bom'}) && $attr->{'cs +v_bom'}) { delete $attr->{'csv_bom'}; $attr->{'f_encoding'}='utf-8):via(File::BOM'; } goto $old_connect; }; do { no strict 'refs'; no warnings 'redefine'; *DBD::CSV::dr::connect=$new_connect; }; } open my $h,'>:encoding(utf-8)','test.csv'; say $h qq<\x{FEFF}"foo","bar","baz">; say $h qq<"1","2","3">; say $h qq<"4","5","6">; close $h; my $dbh=DBI->connect( 'dbi:CSV:', undef, undef, { RaiseError => 1, PrintError => 0, f_ext => '.csv', csv_bom => 1, } ); my $sth=$dbh->prepare('select * from test'); $sth->execute(); while (my @a=$sth->fetchrow_array()) { say join(",",@a); } $sth->finish();

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: DBD::CSV - how to I coax it to read BOM prefixed files?
by Tux (Abbot) on May 29, 2017 at 06:23 UTC

    Work in progress, edited on 2017-02-28, not yet on CPAN, not even committed:

    use v5.12; use warnings; use autodie qw( :all ); use Data::Peek; use DBI; open my $fh, ">:encoding(utf-8)", "test.csv"; say $fh qq<\x{FEFF}"foo","bar","baz">; say $fh qq<"1","2","3">; say $fh qq<"4","5","6">; close $fh; my $dbh=DBI->connect ("dbi:CSV:", undef, undef, { RaiseError => 1, PrintError => 0, f_ext => ".csv/r", csv_bom => 1, }); my $sth=$dbh->prepare ("select * from test"); $sth->execute (); while (my @a = $sth->fetchrow_array ()) { DDumper \@a; } $sth->finish ();
    $ perl test.pl [ 1, 2, 3 ] [ 4, 5, 6 ]

    I can commit and push, so you can play. There are no tests yet and no documentation, so that will delay a real release.

    update: 2017-05-29 09:00, pushed with new documentation to github, get the tgz here to test. If all works out as you want, I'll make a release


    Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1191448]
Approved by cguevara
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2020-07-10 09:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?