Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

DBD::CSV and embedded CR-LFs

by Wally Hartshorn (Hermit)
on Jun 11, 2004 at 21:15 UTC ( [id://363524]=note: print w/replies, xml ) Need Help??


in reply to Re: Finding CR-LFs within quoted CSV fields
in thread Finding CR-LFs within quoted CSV fields

Hmm... That's odd. This is an existing Perl program that is using DBD::CSV and was working fine for months until last week. The user brought up the web-based upload form, uploaded their CSV file (exported from FoxPro), and got 0 records imported. We looked at the file they were uploading and discovered there were a few records that had embedded CR-LFs. Adding a regex to pre-process the uploaded file and remove the embedded CR-LFs solved the problem and allowed DBD::CSV to process the file.

If DBD::CSV can already handle that on its own, then I'm mystified as to why our regex would solve the problem. It looks like we'll need to investigate further. We'll double-check things on Monday to make sure we got the results we thought and will get back you. Thanks!

Wally Hartshorn

Replies are listed 'Best First'.
Re: DBD::CSV and embedded CR-LFs
by jZed (Prior) on Jun 11, 2004 at 21:25 UTC
    If DBD::CSV can already handle that on its own, then I'm mystified as to why our regex would solve the problem
    Me too. Since diotalevi seems to have had the same problem, it's possible there's a bug somewhere. I'd really appreciate help in tracking it down. Could you and diotalevi both please let me know which versions of SQL::Statement and DBD::File you're using, it's more likely to be in them.

      Ah HA! (*smack* -> head) I've figured out the problem, and it has nothing to do with DBD::CSV.

      The CSV file that we're processing is exported in an unusual manner. Embedded quotes aren't escaped in any way. We have a routine that pre-processes the CSV to find those embedded quotes and escapes them (such that a " becomes a "" pair). The processed CSV file is then handed off to DBD::CSV for normal processing.

      Recently, the user added a field to the CSV file, which promptly broke our program. In the process of trying to figure out the problem (because, of course, they didn't tell us they had added a field, only that the program had stopped working), we discovered that CR-LFs were embedded in some fields. We then leaped to the (incorrect) conclusion that this was a recent occurence and the cause of our problems.

      We later learned about the addition of the new field. However, it turned out that the new field also triggered a bug in our pre-processing code, but we didn't know this at the time.

      In trying to fix the supposed problem with the embedded URLs, we had replaced the buggy pre-processing code. So, when we ran the program with the (correct) CR-LF replacement code, it worked. When we reverted to the (buggy) old pre-processing code, it stopped working. That led us to the incorrect conclusion that the CR-LF replacement code was needed.

      So, in summary, DBD::CSV handles embedded CR-LFs fine. Mystery solved! (Of course, if there's a nifty setting to handle unescaped quotes, I'd be glad to learn of it!)

      Wally Hartshorn

        I've figured out the problem,
        I'm glad for you! :-)
        and it has nothing to do with DBD::CSV.
        I'm glad for me! :-)
        Of course, if there's a nifty setting to handle unescaped quotes, I'd be glad to learn of it!)
        Are the fields already quote delimited? In other words do they resemble number 1 or number 2?
           1. "foo","bad " bad","bar"
           2. foo,bad " bad,bar   
        
        If you have records of type #1, I don't have much to suggest beyond the pre-processing you are already doing. If your records are of type #2, however, you ought to be able to handle that all within DBD::CSV. Set csv_delim_char to undef, in which case the stray quote will not be an embedded delimiter, it will just be a quote because you will have set the delimiter character to be undef rather than double-quote. If that won't work, show us some data and maybe someone will have a suggestion.

        Thanks for following up to let us know how it turned out.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://363524]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2024-04-18 07:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found