Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Converting CSV to tab-delimited

by PhilHibbs (Hermit)
on Apr 14, 2008 at 13:48 UTC ( #680281=snippet: print w/ replies, xml ) Need Help??

Description: A colleague is having problems loading CSV (comma-separated variable length) data into Microsoft SSIS, so I told him I'd write a script to help him. I saw Text::CSV, but I don't want to have to go through module installation with him as he's a bit scared of Perl already. So I wrote this little script - and I didn't want to have to teach him about < and > on the command line, so the script automatically generates a file name based on the input with ".tab" on the end.

The rules are the standard Excel-style CSV rules, any embedded '"' characters get doubled up, and any value that contains a ',' character must have '"' characters delimiting the value, but other values don't need to have delimiters.

The code is a simple state machine processing one character at a time and storing two state variables based on whether an opening " has been detected and when a " has been encountered within a quoted value.

Shortcomings:

Doesn't handle newlines in quoted values

use strict;
use warnings;
# Note: doesn't handle newlines in quoted values

my $out = $ARGV[0].".tab";
open OUT,">$out" or die "Can't open output $out\n";
while (<>) {
   my $tab = "";
   my $qv=0; # Quoted value indicator
   my $dq=0; # Double quote flag indicates the previous character was 
+a "
   for (split //) {
      # Start of a quoted value
      if (not $qv and $_ eq '"') { $qv=1; next; }
      
      # Double quotes within or at the end of a quoted value
      if ($qv and $_ eq '"') { $dq=1; next; }

      # If last char was a double quotes OR we're not within a quoted 
+value, comma = tab
      if (($dq or not $qv) and $_ eq ',' ) { $dq=0; $qv=0; $_="\t"; } 
+# End of field

      # Two consecutive double-quote characters within a quoted value
      elsif ($dq and $_ eq '"') { $dq=0; } # Double double quotes
      $tab .= $_;
   }
   print OUT $tab;
}
Comment on Converting CSV to tab-delimited
Download Code
Re: Converting CSV to tab-delimited
by Tux (Monsignor) on Apr 14, 2008 at 13:59 UTC

    And you're sure that M$ doesn't export embedded new-lines, carriage-returns or other binary or special characters?

    There is a very good reason for Text::CSV (and the undelying Text::CSV_XS and Text::CSV_PP) modules to be around, and installing isn't that hard.

    cpan Text::CSV
    use strict; use warnings; my $if = shift; my ($of = $if) =~ s/\.csv$/.tab/ or die "usage: csv2tab file.csv"; open my $fh_i, "<", $if or die "$if: $!"; open my $fh_o, ">", $of or die "$of: $!"; my $csv = Text::CSV->new ({ binary => 1 }); my $tsv = Text::CSV->new ({ binary => 1, sep_char => "\t" }); while (my $row = $csv->getline ($fh_i)) { $tsv->print ($fh_o, $row); } close $fh_i or die "$if: $!"; close $fh_o or die "$of: $!"

    Enjoy, Have FUN! H.Merijn
      Sure, I know that installing Perl modules isn't that hard (once you've tracked down a make program for Windows - and once you've learned to avoid anything that needs a C compiler) but some people just don't want to know. All you have to do is accidentally say "non-standard" and a seemingly-sane business analyst starts to have nightmares about viruses or hackers. I have learned to avoid anything that doesn't come bundled with ActiveState Perl (and thank the g0ds, they've started including Term::ReadKey!!!)

      Most people here are Unix hackers, believe me working with Windows - and habitual Windows users - is really, really frustrating.

      And there are no newlines in the files - I know, it's our software that's creating them.

        Text::CVS can be installed without a C compiler, and even without resorting to the command line, via ActiveState's ppm application.

        And there are no newlines in the files - I know, it's our software that's creating them.

        I suggest amending the introduction to the code and mentioning this, in case someone would want to use the code you posted and isn't sure whether there are newlines in the code.

        Software speaks in tongues of man.
        Stop saying 'script'. Stop saying 'line-noise'.
        We have nothing to lose but our metaphors.

        All well and fine. I have a situation again where ROOTman (as we will call him) does not like anyone adding Perl modules to the production server. Though he's a Perlist himself, requests for additional modules are rebuffed as he cannot risk this machine having a hiccup. We also may be tied to an older release for quite some time, until the new hardware arrives and then we get whatever Perl comes on the RH Enterprise install.

        So, a hand-wired CSV solution is sought by those of us not in a position to "simply ppm or CPAN Text::CSV into place". Good material is sparse - even the CookBook example isn't all that great. I did track down a regex which I have needed to follow up with several checks and edits to patch things up...

        This then is a starting point (ugly/rough code):

        my @inList = split /,(?!(?:[^",]|[^"],[^"])+")/; # and further on a bit of a mess: my @outList = (); for (my $i=0; $i<$flds; $i++) { if (! defined $inList[$i] ) { $inList[$i] = ""; } if ($inList[$i] =~ m/\D/) { $inList[$i] = '"'.$inList[$i].'"'; } $inList[$i] =~ s/^""/"/; $inList[$i] =~ s/""$/"/; $inList[$i] =~ s/^"$/""/; push @outList, $inList[$i]; }

        I eventually got to a point with my data that I simply sanitize all the crap in a field like ",", "'" and """ in self defense, straight after dealing with any nulls.

        I hope this is useful for someone.

      Strawberry perl comes with a shipload of useful bundled modules and a working cpan.bat. I've been playing with it over the weekend, and the only problems I had with installing new modules from CPAN is the SSL related modules and OS specific modules like BSDresource.

      ActivePerl comes with ppm, which has most used modules available in a few keystrokes. Nothing is holding you from increasing your possibilities here. Try to imagine the time you will have to waste explaning the end-user why this oh so simple script suddenly stops working. I can assure you it is more than the time you need to convince him/her to install something good.

      We've entered an era where updating or installing basic modules that have a proven value, is made very very easy, and will pay off over writing code, as simple as it may seem, that will provide you with headaches in the future.


      Enjoy, Have FUN! H.Merijn
      And you're sure that M$ doesn't export embedded new-lines, carriage-returns or other binary or special characters?

      Fyi, line breaks in cells, which are the most common case, are exported as an LF character whereas rows are separated by CRLF. (This of course might not apply to all versions of excel.)

Re: Converting CSV to tab-delimited
by Teva (Initiate) on Jun 03, 2008 at 11:24 UTC
    Hi, I am trying to convert a text file with delimited data into a fully annotated XML SEPA (single Euro payment area) format. I have never worked with XML before and would appreciate any help. thanks,
Re: Converting CSV to tab-delimited
by ReedMeyer (Initiate) on May 24, 2010 at 19:31 UTC

    Hello,

    I just wanted to point out a bug in the original post by "PhilHibbs". That code mostly works, except for the handling of embedded double quotes. To fix this, change the line beginning with "if ($qv ..." to:

    if (not $dq and $_ eq '"') { $dq=1; next; }

    For convenience, here is a complete script that includes the bug fix and also writes the output to standard output, as opposed to writing to a file:

    #!/usr/bin/perl ## Converts an Excel-style CSV-formatted file to TAB-separated format ## Note: doesn't handle newlines in quoted values use strict; use warnings; while (<>) { my $tab = ""; my $qv=0; # Quoted value indicator my $dq=0; # Double quote flag indicates the previous character was +a " for (split //) { # Start of a quoted value if (not $qv and $_ eq '"') { $qv=1; next; } # Double quotes within or at the end of a quoted value if (not $dq and $_ eq '"') { $dq=1; next; } # If last char was a double quotes OR we're not within a quoted +value, comma = tab if (($dq or not $qv) and $_ eq ',' ) { $dq=0; $qv=0; $_="\t"; } +# End of field # Two consecutive double-quote characters within a quoted value elsif ($dq and $_ eq '"') { $dq=0; } # Double double quotes $tab .= $_; } print $tab; }

    Cheers,
    ---Reed

Back to Snippets Section

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: snippet [id://680281]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (9)
As of 2014-08-28 15:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (263 votes), past polls