http://www.perlmonks.org?node_id=321627


in reply to Re: DBD::CSV limitation versus aging/unmaintained modules
in thread DBD::CSV limitation versus aging/unmaintained modules

Try taking the CSV file and cutting it in halves

Gah! One could probably find the problem with much, much less effort if one weren't pathologically opposed to the use of debuggers. q-:

But seriously, in this case I'd build a hash of record IDs returned by Text::CSV and then use the method that "works" and report the records where Text::CSV starts/stop seeing records:

#!/usr/bin/perl use DBI; tdbh = DBI->connect("DBI:CS­V:") or die "Cannot connect: " . $DBI::errstr; my $sth = $tdbh->prepare("sele­ct * from ofphl"); $sth->execute(); my %dbi; my $rec; while( $rec= $sth->fetch() ) { $dbi{$rec->{id_field_name}}++; }; open FILE, "< file.csv" or die "Can't read file.csv: $!\n"; my $has= 1; $|= 1; while( <FILE> ) { my $id= ( split(/,/) )[0]; # Assuming ID is first field; if( !$has != !$dbi{$id} ) { print "DBI ", ( $has ? "stopped" : "started" ), " at record $id.\n"; $has= !$has; } }

Note that you might need to concatenate more than one field if there isn't a unique ID field.

                - tye

Replies are listed 'Best First'.
Re: Re: DBD::CSV limitation versus aging/unmaintained modules (lazy)
by tilly (Archbishop) on Jan 15, 2004 at 21:59 UTC
    For those who don't know what tye is joking about, see Are debuggers good?.

    On the debugging suggestion, you're right that that is a faster approach. It wasn't the one that immediately came to mind for me, but that's life.

    Of course I still suspect that running through the file with Text::xSV, once, will find your error pretty fast if there is an error in the file.

Re: Re: DBD::CSV limitation versus aging/unmaintained modules (lazy)
by Eyck (Priest) on Jan 16, 2004 at 10:53 UTC

    Hmm I thought that I already found where the problem is - record 3964, this is the one after which DBD::CSV stops noticing more records.

    I just can find what exactly DBD::CSV finds wrong about that record/line, and more importantly - why doesn't it emit any kind of warning when it finds those 'corrupted' lines.

      It is possible that Text::xSV can point it out for you. The biggest likelyhood is an unmatched " causing it to read the entire rest of the file as one really, really long line. (It keeps switching from quoted to nonquoted and back as it hits ", and always hits the end of line inside quotes and includes the return in a field.)

      If you post 3 lines from the file (before that line, on that line, and the next line) I should be able to visually spot it. But before you post, verify that DBD::CSV thinks that those 3 lines are only 2 rows.

      Ok, Thanks tilly, after reading your reply I finally started to see what is wrong about that line, it contains something like this:

      ,"Description description "hi world" rest of description",
      And overly-smart modules fail to parse that(not surprisingly).

      While it's easy to state that such file is badly formatted, it've been emitted from large Oracle-based system and there's nothing I can do about it ( not that I would pursue such noble cause now that I solved the problem on my side ).

        But you have not actually solved the problem from your side. You have just hidden it - guaranteed that if any fields anywhere have a comma in it, then you will silently give wrong results.

        I would suggest having your code at least put in some highly visible check for, for instance, an unexpected number of fields. And escalate the formatting issue a level or two. Because if their output doesn't correctly format CSV, then at some point there is nothing that you can do to work around the breakage.