http://www.perlmonks.org?node_id=1080161

ricardo_sdl has asked for the wisdom of the Perl Monks concerning the following question:

I've read about the Text File Format(https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text) and decided to create a perl script to convert a csv file to this format.

Instead of using one of the available modules(like Text::CSV) to parse the csv I decided to do it myself as matter of practice.

I'm starting to learn perl and I ask for your wisdom on how to make this code better.

Updated code after some suggestions.

#!/usr/bin/perl use 5.010; use strict; use warnings; use utf8; use open qw(:std :utf8); #Convert a csv file to text file format #https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii +-delimited-text-not-csv-or-tab-delimited-text #the csv file is expected to use double quotes as the delimiter charac +ter and utf-8 encoded sub is_complete_line { my ($line) = @_; my $inside_double_quotes; my $opening_double_quotes = 0; my $closing_double_quotes = 0; foreach(split('', $line)) { next if $_ ne '"'; if ($inside_double_quotes) { $inside_double_quotes = ! $inside_double_quotes; $closing_double_quotes++; next; } $opening_double_quotes++; $inside_double_quotes = ! $inside_double_quotes; } $opening_double_quotes == $closing_double_quotes; } sub convert_to_ttf { my ($line) = @_; my $inside_double_quotes; my $converted_line = ''; foreach(split('', $line)) { if ($_ eq '"' && $inside_double_quotes) { $inside_double_quotes = 0; next; } if ($_ eq '"' && ! $inside_double_quotes) { $inside_double_quotes = 1; next; } if ($_ eq "," && ! $inside_double_quotes) { $converted_line .= "\x{1F}";#unit separator next; } if ($_ eq "\n" && ! $inside_double_quotes) { $converted_line .= "\x{1E}";#record separator next; } $converted_line .= $_; } $converted_line; } open my $fh, '>', "output.txt" or die "Couldn't create the output file +.\n"; my $line = ''; while (<>) { $line .= $_; if (is_complete_line($line)) { print $fh convert_to_ttf($line); $line = ''; } } close $fh; print "Conversion done.\n";

Here is a link to a sample csv file that you can convert(https://db.tt/GWrID417).

Replies are listed 'Best First'.
Re: Suggestions to make this code more Perlish
by kcott (Archbishop) on Mar 30, 2014 at 02:40 UTC

    G'day ricardo_sdl,

    Welcome to the monastery.

    Here's a shorter way to do it. It's 14 lines (yours is 75 lines). It's definitely Perlish; I'll leave others to comment on whether it's more Perlish. :-)

    #!/usr/bin/env perl use 5.010; use strict; use warnings; use autodie; local $\ = chr 30; open my $csv_fh, '<', 'pm_1080161_input.csv'; open my $tff_fh, '>', 'pm_1080161_output.tff'; while (<$csv_fh>) { chomp; s/(?:"(?<a>[^"]*)"|(?<a>[^,]*)),?/$+{a}\037/g; s/[\037]+$//; print $tff_fh $_; }

    Update: Just to summarise the discussion from the six nodes that follow this, here's an improved solution:

    #!/usr/bin/env perl use 5.010; use strict; use warnings; use autodie; local $\ = chr 30; open my $csv_fh, '<', 'pm_1080161_input.csv'; open my $tff_fh, '>', 'pm_1080161_output.tff'; my $re = qr{ " (?<field> [^"]* ) " | (?<field> [^,]* ) }x; print $tff_fh $_ for map { chomp; s/$re,/$+{field}\037/g; $_ } <$csv_f +h>;

    End update.

    See perlre for documentation on any of regex code. Ask if there's anything else you don't understand.

    And here's a shorter way to view the output. It's one line vs. your 29 line script (in Re^2: Suggestions to make this code more Perlish).

    $ perl -pe 'y/\036\037/\012|/' pm_1080161_output.tff Country|Name|Address|Age Andorra|Aileen, Daquan, Hammett, Malachi|17014|69 Solomon Islands|Cyrus, Giacomo, Gretchen, Curran|76935|26 Czech Republic|Briar, Larissa, Sybil, Colin|29565|88 Japan|Wyatt, Gavin, Derek, Coby|10734|52 Gabon|Tobias, Maya, April, Quintessa|77397|95

    The rest of the output is in the spoiler:

    -- Ken

      Just for interest, here's the same approach in Perl 6:
      use v6; my regex fields { \" <( .*? )> \" | <-[,"]>* } my $input = open 'input.csv', :r; my $output = open 'output.tff', :w; for $input.lines { next unless /<fields>* % ','/; $output.print: @<fields>.join(chr 31) ~ chr 30; } $output.close;
      That's not the shortest way to do it in Perl 6, but it's closest to the Perl 5 example above.

      A shorter and more Perl6-ish version might be:

      my regex fields { \" <( .*? )> \" | <-[,"]>* } lines open 'input.csv' ==> map({@<fields>.join(chr 31) ~ chr 30 if /<fields>* % ','/}) ==> spurt('output.tff');

      Damian

        Thanks for that Damian. I'm not really across Perl6 syntax. I looked in Perl6 Regexes documentation; unfortunately, there's several sections with nothing more than "TODO", including "Alternation" and "Grouping and Capturing", so I pretty much gave up at that point. Can you suggest a better source of documentation?

        Anyway, inspired by your "shorter and more Perl6-ish version", here's a shorter and more Perl5-ish version of my original (this replaces the while loop, everything else remains the same):

        my $re = qr{(?:"(?<a>[^"]*)"|(?<a>[^,]*))(?:,|\000)}; print $tff_fh $_ for map { chomp; s/$re/$+{a}\037/g; $_ } <$csv_fh>;

        Due to the issue described in "Repeated Patterns Matching a Zero-length Substring", I was getting '\037\037' (at the end of $_) after each 's///g': hence the 's/[\037]+$//;' to remove them.

        I found that by replacing ',?' with '(?:,|\000)', I got zero '\037' characters after the 's///g' (so the 's/[\037]+$//;' wasn't needed at all). [Note: '(?:,|)', '(?:,|$)', '(?:,|\z)' and '(?:,|\Z)' all produced '\037\037' after each 's///g'.]

        While I suspect this has something to do with '\0' terminated strings in C, I don't fully understand what's happening. As it could be a side effect that might behave differently in another Perl version (I'm using v5.18.1), and not being able to answer the inevitable "How does this work?" question, I left it out of my original solution.

        You, or someone else, may have a quick answer. If not, I was planning to spend a bit more time looking into this and, in the absence of finding a solution, post a more generalised example with a question later in the week.

        -- Ken

      Ken, your original solution handles lines like this properly:
      Bangladesh,"Claire, Laura, Phillip, Pascale",70016,"5,6,7,8"
      whereas s/$re,/$+{field}\037/g will never dequote the last field.

      One other minor observation: \037 is an atom, so no brackets are needed in s/[\037]+$//;

        Thanks for taking the time to study my code and provide feedback.

        "your original solution handles lines like this properly:

            Bangladesh,"Claire, Laura, Phillip, Pascale",70016,"5,6,7,8"

        whereas s/$re,/$+{field}\037/g will never dequote the last field."

        I made reference to the fact that we're dealing with "very specific input" in "Re^4: Suggestions to make this code more Perlish". I also alluded to the fact that different code would be required if the input changed.

        You're correct in saying that it "will never dequote the last field"; however, the last field (in this "very specific input") is an unquoted integer, so no dequoting is required.

        "One other minor observation: \037 is an atom, ..."

        Yes, that's how I used \037 (along with \036 and \012) in:

        $ perl -pe 'y/\036\037/\012|/' pm_1080161_output.tff
        "... so no brackets are needed in s/[\037]+$//;"

        The use of brackets here was a conscious decision to make it more obvious what the '+' quantifier referred to.

        You're correct in saying that 's/\037+$//' would have worked as well as 's/[\037]+$//'.

        -- Ken

      Wow! There is a lot that I need to learn about regular expressions to even begin to understand the code. Time to hit the docs! Anyway thanks kcott, TheDamian and the others for your suggestions.

      One more question. When dealing with utf-8 files, is there no need to indicate this to the Perl interpreter? Reading the file, writing the file contents and matching using regular expressions will just work out of the box with utf-8 encoded strings?

        Probably the first thing to note is that the numeric value of 7-bit ASCII characters is the same as the UTF-8 code points for the same Unicode characters. The ASCII character "A" has the hexidecimal value 41; The UTF-8 character "A" has the hexidecimal value 41.

        The term Unicode is often used in a sense to indicate characters outside the range of 7-bit ASCII characters. This often degenerates into arguments over what was said, what was meant, what's techinically correct and so on. For the remainder of this node, assume ASCII refers to 7-bit ASCII characters and Unicode refers to UTF-8 characters outside the range of 7-bit ASCII characters.

        In the simplest case, if your Perl source code, input data and output data contain only ASCII characters, there's no need to do anything special. This was the case with your code and data here: ASCII characters were used throughout so no special pragmata or encoding directives were required.

        [For the following examples, note that the letter A has a numerical value of 65 decimal (41 hexidecimal) and the smiley face character has a numerical value of 9786 decimal (263a hexidecimal).]

        Here's a short piece of Perl code with just ASCII characters:

        $ perl -E 'say ord "A"; say sprintf "%x", ord "A"'
        65
        41
        

        Here's a similar piece of Perl code but this also includes Unicode characters:

        $ perl -E 'say ord "☺"; say sprintf "%x", ord "☺"'
        226
        e2
        

        As you can see, that second example didn't work very well: it produced unexpected results. Because the source code contained Unicode characters, you need to tell Perl this with the utf8 pragma:

        $ perl -E 'use utf8; say ord "☺"; say sprintf "%x", ord "☺"'
        9786
        263a
        

        Here's a short piece of Perl code which outputs ASCII characters:

        $ perl -E 'say chr 65; say "\x{41}"'
        A
        A
        

        Here's a similar piece of Perl code which outputs Unicode characters:

        $ perl -E 'say chr 9786; say "\x{263a}"'
        Wide character in say at -e line 1.
        ☺
        Wide character in say at -e line 1.
        ☺
        

        As you can see, that produced warnings; however, if we let Perl know to expect Unicode output with binmode, we get a better result:

        $ perl -E 'binmode STDOUT => ":utf8"; say chr 9786; say "\x{263a}"'
        ☺
        ☺
        

        Those were just trivial examples. See the documentation for details. I suggest you start with:

        And then move on to:

        [Note: In order to display Unicode characters exactly as coded or output, I've used <pre>...</pre> and <tt>...</tt> tags instead of <code>...</code> and <c>...</c> tags.]

        -- Ken

Re: Suggestions to make this code more Perlish
by NetWallah (Canon) on Mar 29, 2014 at 06:18 UTC
    In addition to the Regex suggestion - I'd suggest avoiding the @lines altogether.

    Instead of "push @lines, $line (Storing it), why don't you call convert_to_ttf($line) at that point - this way, you can avoid the store .. and .. process .

    Similarly, instead of storing @converted lines, then printing them, do that at the same place - so your code looks like:

    open my $fh ... etc while (<>) { $line .= $_; if (is_complete_line($line)) { print $fh convert_to_ttf ($line); $line = ''; } } close $fh;

            What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?
                  -Larry Wall, 1992

      Thanks. I updated the code.

Re: Suggestions to make this code more Perlish
by Laurent_R (Canon) on Mar 29, 2014 at 22:27 UTC

    Alright, I sort of understand now.

    This is my "more Perlish" proposal:

    #!/usr/bin/perl use strict; use warnings; my $first_line = <DATA>; chomp $first_line; $first_line =~ s/,/chr(31)/eg; print $first_line, chr(30); while (<DATA>) { chomp; my @fields = split /,"|",/; $fields[2] =~ s/,/chr(31)/eg; print join chr(31), @fields, chr(30); } __DATA__ Country,Name,Address,Age Andorra,"Aileen, Daquan, Hammett, Malachi",17014,69 Solomon Islands,"Cyrus, Giacomo, Gretchen, Curran",76935,26 Czech Republic,"Briar, Larissa, Sybil, Colin",29565,88 (...)
    As such, the program produces the same strange looking output as the one I've shown before:
    CountryNameAddressAgeAndorraAileen, Daquan, Hammett, Malachi1701469Solomon IslandsCyrus, Giacomo, Gretchen, Curran7693526Czech RepublicBriar,
    But once piped to the other program that you have shown, this produces, I believe, the same output as your program piped into the same second program:
    Country|Name|Address|Age Andorra|Aileen, Daquan, Hammett, Malachi|17014|69| Solomon Islands|Cyrus, Giacomo, Gretchen, Curran|76935|26| Czech Republic|Briar, Larissa, Sybil, Colin|29565|88| Japan|Wyatt, Gavin, Derek, Coby|10734|52| (...)
    The full output is displayed below:

    As you can see, "more Perlish" often means much shorter code (although I really did not try to make it particularly concise and I decided, for simplicity, to process the header line separately. If you ignore that and forget about boiler plate code at the beginning, the real program holds in 5 lines). I also think that, on very large input, my version is most probably considerably faster.

    I'll be glad to give more information if there is anything you don't understand.

      That's a good solution but it has a couple of rough edges.

      I don't think the first line needs to be treated separately. You could remove these four lines:

      my $first_line = <DATA>; chomp $first_line; $first_line =~ s/,/chr(31)/eg; print $first_line, chr(30);

      Your join is joining the fields and the record separator. Just add parentheses to

      print join chr(31), @fields, chr(30);

      like this

      print join(chr(31), @fields), chr(30);

      That way, the field separators will just separate the fields. :-)

      You're also not adequately handling a quoted field at the start of a record or two quoted fields adjacent to each other. Here's two examples from your output (I think these are the only ones):

      ... "Bonaire, Sint Eustatius and Saba|"Charissa, Lana, Liberty, Quail|4451 +7|18| ... "Virgin Islands, British|"Otto, Macon, Caldwell, Sasha|87676|49| ...

      Here's the matching lines from the input:

      ... "Bonaire, Sint Eustatius and Saba","Charissa, Lana, Liberty, Quail",44 +517,18 ... "Virgin Islands, British","Otto, Macon, Caldwell, Sasha",87676,49 ...

      -- Ken

        Thanks for your comments, Ken. Yes, you are right, this is a 10-minutes solution, certainly not a polished one.

        I needed to process the first line differently because of the way I chose to process the other lines, which would not work for the first one, but it is certainly possible to find another way to process the lines that would also work for the first one. However, when I have a header line that needs to be processed differently than the rest of the file, I often prefer to process it before starting to loop on the rest of the file, because the algorithm is then simpler (and often faster, which matters if the file is large).

        You are right on the "join" line, it adds a field separator at the end of the records. That did not shock me, but it is indeed different from the output produced by the code in the original post. Adding parens (as per your proposal) solves the issue.

        I had not even seen that there were two "irregular" lines with quoted fields at the beginning of the records in the input data, and that of course is a serious problem because it probably means that the whole algorithm has to be modified. BTW, this is a good example of why using a module such as Text::CSV is often better than doing one's own solution.

Re: Suggestions to make this code more Perlish
by Laurent_R (Canon) on Mar 29, 2014 at 16:24 UTC
    Hi, I think that what you are trying to do could be done in a dozen lines of code, or so (or even less), and I would gladly suggest a short "Perlish" solution, but I can't make sense of the messy output format generated by your code with your input data:
    CountryNameAddressAgeAndorraAileen, Daquan, Hammett, Malachi1701469Solomon IslandsCyrus, Giacomo, Gretchen, Curran7693526Czech RepublicBriar, (...)
    (Carriage returns added to the generated output to avoid one very very long line.) This is surely not the output that you want. Please show the output that you really need.

      That is the intended output. The ascii character 30(Record Separator) is used as the separator between each line. The ascii character 31(Unit Separator) is used as the separator between each column.

      You can use this script to show the output using new lines and pipe characters which make it easier to read.

      #!/usr/bin/perl use 5.010; use strict; use warnings; use utf8; use open qw(:std :utf8); #Print a text file format to standard output. #https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii +-delimited-text-not-csv-or-tab-delimited-text sub print_line { my ($line) = @_; foreach(split('', $line)) { if ($_ eq "\x{1F}") { print "|"; next; } elsif($_ eq "\x{1E}") { print "\n"; next; } print $_; } } while (<>) { print_line($_); print "\n"; }
Re: Suggestions to make this code more Perlish
by Anonymous Monk on Mar 28, 2014 at 21:41 UTC

      Thank you. I'll take a look at the regular expressions tutorials and try to change my code.