Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^3: Comparison of the parsing features of CSV (and xSV) modules

by Wally Hartshorn (Hermit)
on Jun 15, 2004 at 21:14 UTC ( [id://367056]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Comparison of the parsing features of CSV (and xSV) modules
in thread Comparison of the parsing features of CSV (and xSV) modules

Here's an example:

"Smith","John",12/31/1962,"Author of "How to Break Programs" and other books","Bugger"

I'm using a series of (somewhat fragile) regexes to change that to:

"Smith","John",12/31/1962,"Author of ""How to Break Programs"" and other books","Bugger"

Wally Hartshorn

Replies are listed 'Best First'.
Re^4: Comparison of the parsing features of CSV (and xSV) modules
by tilly (Archbishop) on Jun 18, 2004 at 02:01 UTC
    There are, of course, going to be boundary cases that don't work as expected as soon as you start playing with allowing undoubled double-quotes inside of a format that expects them doubled. However Text::xSV allows you to define arbitrary filters that it preprocesses text with, and should do a reasonable job on the above with the following filter:
    sub { my $line = shift; $line =~ s/\r$//; $line =~ s/"(.)/""$1/g; $line =~ s/"?,"?/,/g; return $line; }
    Yes, there is some fragility, but it should be at least moderately hard to trigger.
Re^4: Comparison of the parsing features of CSV (and xSV) modules
by dragonchild (Archbishop) on Jun 15, 2004 at 23:17 UTC
    And, what should the parser do with the following:
    "Smith","John",12/31/1962,"Author of "How to Break Programs" and other + books,"Bugger" "Smith","John",12/31/1962,Author of "How to Break Programs" and other +books,"Bugger" "Smith","John",12/31/1962,'Author of "How to Break Programs" and other + books,"Bugger" "Smith","John",12/31/1962,'Author of "How to Break Programs" and other + books',"Bugger"

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      And, what should the parser do with the following:
      "Smith","John",12/31/1962,"Author of "How to Break Programs" and other + books,"Bugger" "Smith","John",12/31/1962,"Author of ""How to Break Programs"" and oth +er books,"Bugger"
      "Smith","John",12/31/1962,Author of "How to Break Programs" and other +books,"Bugger" "Smith","John",12/31/1962,Author of ""How to Break Programs"" and othe +r books,"Bugger"
      "Smith","John",12/31/1962,'Author of "How to Break Programs" and other + books,"Bugger" (Reject?)
      "Smith","John",12/31/1962,'Author of "How to Break Programs" and other + books',"Bugger" (Reject?)

      (I haven't encountered any improperly quoted data, just data that doesn't escape embedded delimiters.)

      Wally Hartshorn

        What about the following:
        abcd,"efgh,"ijkl,"mnop",qrst
        Is that malformed or is that meant to be
        abcd,"efgh,""ijkl,""mnop",qrst

        The issue is that there are too many edge cases for a general-purpose parser to handle. I'm coming up with a bunch and I'm not even trying hard.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        I shouldn't have to say this, but any code, unless otherwise stated, is untested

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://367056]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-25 07:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found