Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^2: Suggestions to make this code more Perlish

by TheDamian (Vicar)
on Mar 30, 2014 at 11:25 UTC ( [id://1080278]=note: print w/replies, xml ) Need Help??


in reply to Re: Suggestions to make this code more Perlish
in thread Suggestions to make this code more Perlish

Just for interest, here's the same approach in Perl 6:
use v6; my regex fields { \" <( .*? )> \" | <-[,"]>* } my $input = open 'input.csv', :r; my $output = open 'output.tff', :w; for $input.lines { next unless /<fields>* % ','/; $output.print: @<fields>.join(chr 31) ~ chr 30; } $output.close;
That's not the shortest way to do it in Perl 6, but it's closest to the Perl 5 example above.

A shorter and more Perl6-ish version might be:

my regex fields { \" <( .*? )> \" | <-[,"]>* } lines open 'input.csv' ==> map({@<fields>.join(chr 31) ~ chr 30 if /<fields>* % ','/}) ==> spurt('output.tff');

Damian

Replies are listed 'Best First'.
Re^3: Suggestions to make this code more Perlish
by kcott (Archbishop) on Mar 30, 2014 at 15:08 UTC

    Thanks for that Damian. I'm not really across Perl6 syntax. I looked in Perl6 Regexes documentation; unfortunately, there's several sections with nothing more than "TODO", including "Alternation" and "Grouping and Capturing", so I pretty much gave up at that point. Can you suggest a better source of documentation?

    Anyway, inspired by your "shorter and more Perl6-ish version", here's a shorter and more Perl5-ish version of my original (this replaces the while loop, everything else remains the same):

    my $re = qr{(?:"(?<a>[^"]*)"|(?<a>[^,]*))(?:,|\000)}; print $tff_fh $_ for map { chomp; s/$re/$+{a}\037/g; $_ } <$csv_fh>;

    Due to the issue described in "Repeated Patterns Matching a Zero-length Substring", I was getting '\037\037' (at the end of $_) after each 's///g': hence the 's/[\037]+$//;' to remove them.

    I found that by replacing ',?' with '(?:,|\000)', I got zero '\037' characters after the 's///g' (so the 's/[\037]+$//;' wasn't needed at all). [Note: '(?:,|)', '(?:,|$)', '(?:,|\z)' and '(?:,|\Z)' all produced '\037\037' after each 's///g'.]

    While I suspect this has something to do with '\0' terminated strings in C, I don't fully understand what's happening. As it could be a side effect that might behave differently in another Perl version (I'm using v5.18.1), and not being able to answer the inevitable "How does this work?" question, I left it out of my original solution.

    You, or someone else, may have a quick answer. If not, I was planning to spend a bit more time looking into this and, in the absence of finding a solution, post a more generalised example with a question later in the week.

    -- Ken

      Hi Ken,

      The best place to read up about Perl 6 regexes is the specification itself.

      You mused:

      While I suspect this has something to do with '\0' terminated strings in C, I don't fully understand what's happening.

      No, it's not anything to do with C string terminators.

      The problem with your previous version was that you were matching an optional comma at the end of each field and then replacing it with a definite "\037" every time. So, for the last field in each record (which, of course, isn't followed by a comma), your were nevertheless appending an unwanted "\037".

      The global substitution would then loop one last time, matching a final zero-character field (because of the (?<a>[^,]*) alternative, which can match nothing). The substitution on that empty field then causes a second unnecessary "\037" to be appended.

      You could fix that by rewriting your original version something like this:

      open my $csv_fh, '<', 'input.csv'; open my $tff_fh, '>', 'output.tff'; my $field = qr{ " (?<field> [^"]* ) " | (?<field> [^,"]* ) }x; while (my $line = <$csv_fh>) { $line =~ s{ $field (?<comma> ,?) } { $+{field} . ($+{comma} && chr 31) }gxe; $line =~ s{\n}{chr 30}xe; print {$tff_fh} $line; }

      This version still matches the optional comma each time, but now only appends a "\037" if there actually was a comma. Which means there are no extras to remove, once the line is complete.

      Note that I also removed the chomp and replaced it with an explicit substitution of the trailing newline. I felt that this highlights the transformation more clearly than did your clever (but subtle and "at-a-distance") use of $\.

      Damian

        Thanks for the documentation link. That certainly has a lot more text than I found where I was previously looking: and no "TODO"s in sight. I have a bit of reading ahead of me.

        Yes, I was aware of why I was getting two \037 characters at the end (in the first solution). What I haven't figured out yet is why I was getting zero \037 characters at the end (when I changed ',?' to '(?:,|\000)' — in the second solution).

        Thanks also for the additional feedback: much appreciated.

        -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1080278]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2025-05-23 01:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.