Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Handling an embedded newline in an unquoted CSV field

by olivierp (Hermit)
on Dec 10, 2011 at 10:20 UTC ( #942780=perlquestion: print w/ replies, xml ) Need Help??
olivierp has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have a Microsoft application generated CSV file (encoded with MS's variant of Unicode), where certain fields can contain a newline.
To my demise, they are not quoted as they neither contain the field separator OR the Windows eol marker. E.g it looks like this:

Field1,Field2,Field3\r\n
text\ncontinued,"I have a comma, so I'm quoted",I don't\r\n

I've tried Text::xSV & Text::CSV, and both get trapped on the embedded newline.
Is there something I'm missing to coax either module (or another suggestion) to handle this cleanly ?


--
Olivier

Comment on Handling an embedded newline in an unquoted CSV field
Re: Handling an embedded newline in an unquoted CSV field
by Tux (Monsignor) on Dec 10, 2011 at 10:30 UTC

    Not that I am aware off. You're trapped.

    If you know in advance how many fields each line should have, you could create your own parser based on this example.


    Enjoy, Have FUN! H.Merijn
      DBD::CSV has always worked best for me with CSVs and lets me suck it all into an HoH, but not sure it would work with a file like the one you're describing, albeit what you're describing doesn't sound like a CSV in the conventional sense...
Re: Handling an embedded newline in an unquoted CSV field
by TJPride (Pilgrim) on Dec 10, 2011 at 14:15 UTC
    Yes, about all you can do is check if a line is short some fields, and if it is and the next line is also short, put them together. But this is further complicated by the possibility that the newlines could be inside a quoted field, in which case the CSV parsing will choke and you have to figure out a much more complicated way of counting fields. And the file could theoretically not have a header, or the header might not have names for all the fields, in which case figuring out how many fields the file -should- have is difficult as well. And what if the field with the newline is at the very end of the record and unquoted? It'll look like that record is fine and the problem starts on the next record.

    This is one of those problems I wish I had an easy answer for as well. I've run into it a number of times while doing data imports for one of my clients.

Re: Handling an embedded newline in an unquoted CSV field
by bluescreen (Friar) on Dec 10, 2011 at 14:19 UTC

    You have to change the record separator to "\r\n"

    use strict; use warnings; use 5.010; my $text = "field\n1,field2,field3\r\nfield4,field\n5,field6\n\r\n"; open my $fh, '<', \$text; local $/ = "\r\n"; while (<$fh>) { chomp; my ( $col1, $col2, $col3 ) = split ','; say "|$col1 $col2 $col3|"; }

    Of course when you print it you'll see the new lines, if you don't want that either you can replace them.

Re: Handling an embedded newline in an unquoted CSV field
by tobyink (Abbot) on Dec 10, 2011 at 18:30 UTC

    bluescreen's suggestion is a good one if your CSV is otherwise uncomplicated - i.e. no quoted fields.

    But for complicated CSV, you really do want to use something like Text::CSV. It is a shame that, even setting Text::CSV's eol option, it can't handle this situation.

    Best solution would probably be to pre-process each line, replacing the bare "\n" with another character, parse it with Text::CSV, and then replace it back again. Something like:

    use Data::Dumper; use IO::Scalar; use Text::CSV; my $data = "Alice,123\r\n" . "Bob\nSmith,456\r\n"; my $io = IO::Scalar->new(\$data); $/ = "\r\n"; my $CSV = Text::CSV->new({ eol => $/, sep_char => q{,}, binary => 1, }); while (defined(my $line = <$io>)) { $line =~ s/\n(.)/\xFF$1/g; $CSV->parse($line); my @row = map { s/\xFF/\n/g; $_ } $CSV->fields; print Dumper(\@row); }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://942780]
Approved by sundialsvc4
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2014-12-27 05:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls