http://www.perlmonks.org?node_id=1080299


in reply to Re^2: Suggestions to make this code more Perlish
in thread Suggestions to make this code more Perlish

Thanks for that Damian. I'm not really across Perl6 syntax. I looked in Perl6 Regexes documentation; unfortunately, there's several sections with nothing more than "TODO", including "Alternation" and "Grouping and Capturing", so I pretty much gave up at that point. Can you suggest a better source of documentation?

Anyway, inspired by your "shorter and more Perl6-ish version", here's a shorter and more Perl5-ish version of my original (this replaces the while loop, everything else remains the same):

my $re = qr{(?:"(?<a>[^"]*)"|(?<a>[^,]*))(?:,|\000)}; print $tff_fh $_ for map { chomp; s/$re/$+{a}\037/g; $_ } <$csv_fh>;

Due to the issue described in "Repeated Patterns Matching a Zero-length Substring", I was getting '\037\037' (at the end of $_) after each 's///g': hence the 's/[\037]+$//;' to remove them.

I found that by replacing ',?' with '(?:,|\000)', I got zero '\037' characters after the 's///g' (so the 's/[\037]+$//;' wasn't needed at all). [Note: '(?:,|)', '(?:,|$)', '(?:,|\z)' and '(?:,|\Z)' all produced '\037\037' after each 's///g'.]

While I suspect this has something to do with '\0' terminated strings in C, I don't fully understand what's happening. As it could be a side effect that might behave differently in another Perl version (I'm using v5.18.1), and not being able to answer the inevitable "How does this work?" question, I left it out of my original solution.

You, or someone else, may have a quick answer. If not, I was planning to spend a bit more time looking into this and, in the absence of finding a solution, post a more generalised example with a question later in the week.

-- Ken

Replies are listed 'Best First'.
Re^4: Suggestions to make this code more Perlish
by TheDamian (Vicar) on Mar 30, 2014 at 19:43 UTC
    Hi Ken,

    The best place to read up about Perl 6 regexes is the specification itself.

    You mused:

    While I suspect this has something to do with '\0' terminated strings in C, I don't fully understand what's happening.

    No, it's not anything to do with C string terminators.

    The problem with your previous version was that you were matching an optional comma at the end of each field and then replacing it with a definite "\037" every time. So, for the last field in each record (which, of course, isn't followed by a comma), your were nevertheless appending an unwanted "\037".

    The global substitution would then loop one last time, matching a final zero-character field (because of the (?<a>[^,]*) alternative, which can match nothing). The substitution on that empty field then causes a second unnecessary "\037" to be appended.

    You could fix that by rewriting your original version something like this:

    open my $csv_fh, '<', 'input.csv'; open my $tff_fh, '>', 'output.tff'; my $field = qr{ " (?<field> [^"]* ) " | (?<field> [^,"]* ) }x; while (my $line = <$csv_fh>) { $line =~ s{ $field (?<comma> ,?) } { $+{field} . ($+{comma} && chr 31) }gxe; $line =~ s{\n}{chr 30}xe; print {$tff_fh} $line; }

    This version still matches the optional comma each time, but now only appends a "\037" if there actually was a comma. Which means there are no extras to remove, once the line is complete.

    Note that I also removed the chomp and replaced it with an explicit substitution of the trailing newline. I felt that this highlights the transformation more clearly than did your clever (but subtle and "at-a-distance") use of $\.

    Damian

      Thanks for the documentation link. That certainly has a lot more text than I found where I was previously looking: and no "TODO"s in sight. I have a bit of reading ahead of me.

      Yes, I was aware of why I was getting two \037 characters at the end (in the first solution). What I haven't figured out yet is why I was getting zero \037 characters at the end (when I changed ',?' to '(?:,|\000)' — in the second solution).

      Thanks also for the additional feedback: much appreciated.

      -- Ken

        What I haven't figured out yet is why I was getting zero \037 characters at the end (when I changed ',?' to '(?:,|\000)' — in the second solution).

        My apologies for misinterpreting your implied question.

        The reason your second solution is producing zero trailing "\037" characters is because (?:,|\000) can never match nothing. It either matches a trailing comma, or a trailing null character. So on the very last field (which has neither a trailing comma nor a trailing null-byte), your field pattern wasn't matching at all, so you were not rewriting the last field at all, hence no extra "\037" was added after it.

        And, because that final field failed to match, the global matching sequence was terminated at that point, so the regex didn't do that one extra "match an empty field at the end" iteration, which was previously adding the second "\037".

        Technically, the use of '(?:,|\000)' introduced a bug, as it would then treat any embedded null as a field separator. Granted, it is quite unlikely to find an embedded null in a CSV file, but not impossible.

        If you wanted to keep using this approach, you could avoid that nasty edge-case by replacing the (?:,|\000) subpattern with a simple comma:

            my $re = qr{ (?: "(?<a>[^"]*)" | (?<a>[^,]*) ), }x;

        Damian