in reply to Suggestions to make this code more Perlish
G'day ricardo_sdl,
Welcome to the monastery.
Here's a shorter way to do it. It's 14 lines (yours is 75 lines).
It's definitely Perlish;
I'll leave others to comment on whether it's more Perlish. :-)
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use autodie;
local $\ = chr 30;
open my $csv_fh, '<', 'pm_1080161_input.csv';
open my $tff_fh, '>', 'pm_1080161_output.tff';
while (<$csv_fh>) {
chomp;
s/(?:"(?<a>[^"]*)"|(?<a>[^,]*)),?/$+{a}\037/g;
s/[\037]+$//;
print $tff_fh $_;
}
Update:
Just to summarise the discussion from the six nodes that follow this, here's an improved solution:
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use autodie;
local $\ = chr 30;
open my $csv_fh, '<', 'pm_1080161_input.csv';
open my $tff_fh, '>', 'pm_1080161_output.tff';
my $re = qr{ " (?<field> [^"]* ) " | (?<field> [^,]* ) }x;
print $tff_fh $_ for map { chomp; s/$re,/$+{field}\037/g; $_ } <$csv_f
+h>;
End update.
See perlre for documentation on any of regex code.
Ask if there's anything else you don't understand.
And here's a shorter way to view the output.
It's one line vs. your 29 line script (in Re^2: Suggestions to make this code more Perlish).
$ perl -pe 'y/\036\037/\012|/' pm_1080161_output.tff
Country|Name|Address|Age
Andorra|Aileen, Daquan, Hammett, Malachi|17014|69
Solomon Islands|Cyrus, Giacomo, Gretchen, Curran|76935|26
Czech Republic|Briar, Larissa, Sybil, Colin|29565|88
Japan|Wyatt, Gavin, Derek, Coby|10734|52
Gabon|Tobias, Maya, April, Quintessa|77397|95
The rest of the output is in the spoiler:
Re^2: Suggestions to make this code more Perlish
by TheDamian (Vicar) on Mar 30, 2014 at 11:25 UTC
|
Just for interest, here's the same approach in Perl 6:
use v6;
my regex fields { \" <( .*? )> \" | <-[,"]>* }
my $input = open 'input.csv', :r;
my $output = open 'output.tff', :w;
for $input.lines {
next unless /<fields>* % ','/;
$output.print: @<fields>.join(chr 31) ~ chr 30;
}
$output.close;
That's not the shortest way to do it in Perl 6, but it's closest to the Perl 5 example above.
A shorter and more Perl6-ish version might be:
my regex fields { \" <( .*? )> \" | <-[,"]>* }
lines open 'input.csv'
==> map({@<fields>.join(chr 31) ~ chr 30 if /<fields>* % ','/})
==> spurt('output.tff');
Damian
| [reply] [d/l] [select] |
|
Thanks for that Damian.
I'm not really across Perl6 syntax.
I looked in Perl6 Regexes documentation; unfortunately, there's several sections with nothing more than "TODO", including "Alternation" and "Grouping and Capturing", so I pretty much gave up at that point.
Can you suggest a better source of documentation?
Anyway, inspired by your "shorter and more Perl6-ish version", here's a shorter and more Perl5-ish version of my original (this replaces the while loop, everything else remains the same):
my $re = qr{(?:"(?<a>[^"]*)"|(?<a>[^,]*))(?:,|\000)};
print $tff_fh $_ for map { chomp; s/$re/$+{a}\037/g; $_ } <$csv_fh>;
Due to the issue described in "Repeated Patterns Matching a Zero-length Substring", I was getting '\037\037' (at the end of $_) after each 's///g': hence the 's/[\037]+$//;' to remove them.
I found that by replacing ',?' with '(?:,|\000)', I got zero '\037' characters after the 's///g' (so the 's/[\037]+$//;' wasn't needed at all).
[Note: '(?:,|)', '(?:,|$)', '(?:,|\z)' and '(?:,|\Z)' all produced '\037\037' after each 's///g'.]
While I suspect this has something to do with '\0' terminated strings in C, I don't fully understand what's happening.
As it could be a side effect that might behave differently in another Perl version (I'm using v5.18.1), and not being able to answer the inevitable "How does this work?" question, I left it out of my original solution.
You, or someone else, may have a quick answer.
If not, I was planning to spend a bit more time looking into this and, in the absence of finding a solution, post a more generalised example with a question later in the week.
| [reply] [d/l] [select] |
|
Hi Ken,
The best place to read up about Perl 6 regexes is the specification itself.
You mused:
While I suspect this has something to do with '\0' terminated strings in C, I don't fully understand what's happening.
No, it's not anything to do with C string terminators.
The problem with your previous version was that you were matching an optional comma at the end of each field and then replacing it with a definite "\037" every time. So, for the last field in each record (which, of course, isn't followed by a comma), your were nevertheless appending an unwanted "\037".
The global substitution would then loop one last time, matching a final zero-character field (because of the (?<a>[^,]*) alternative, which can match nothing). The substitution on that empty field then causes a second unnecessary "\037" to be appended.
You could fix that by rewriting your original version something like this:
open my $csv_fh, '<', 'input.csv';
open my $tff_fh, '>', 'output.tff';
my $field = qr{ " (?<field> [^"]* ) " | (?<field> [^,"]* ) }x;
while (my $line = <$csv_fh>) {
$line =~ s{ $field (?<comma> ,?) }
{ $+{field} . ($+{comma} && chr 31) }gxe;
$line =~ s{\n}{chr 30}xe;
print {$tff_fh} $line;
}
This version still matches the optional comma each time, but now only appends a "\037" if there actually was a comma. Which means there are no extras to remove, once the line is complete.
Note that I also removed the chomp and replaced it with an explicit substitution of the trailing newline. I felt that this highlights the transformation more clearly than did your clever (but subtle and "at-a-distance") use of $\.
Damian
| [reply] [d/l] [select] |
|
|
|
Re^2: Suggestions to make this code more Perlish
by hazylife (Monk) on Mar 31, 2014 at 10:50 UTC
|
Ken, your original solution handles lines like this properly:
Bangladesh,"Claire, Laura, Phillip, Pascale",70016,"5,6,7,8"
whereas s/$re,/$+{field}\037/g will never dequote the last field.
One other minor observation: \037 is an atom, so no brackets are needed in s/[\037]+$//; | [reply] [d/l] [select] |
|
Thanks for taking the time to study my code and provide feedback.
"your original solution handles lines like this properly:
Bangladesh,"Claire, Laura, Phillip, Pascale",70016,"5,6,7,8"
whereas s/$re,/$+{field}\037/g will never dequote the last field."
I made reference to the fact that we're dealing with "very specific input" in "Re^4: Suggestions to make this code more Perlish".
I also alluded to the fact that different code would be required if the input changed.
You're correct in saying that it "will never dequote the last field"; however, the last field (in this "very specific input") is an unquoted integer, so no dequoting is required.
"One other minor observation: \037 is an atom, ..."
Yes, that's how I used \037 (along with \036 and \012) in:
$ perl -pe 'y/\036\037/\012|/' pm_1080161_output.tff
"... so no brackets are needed in s/[\037]+$//;"
The use of brackets here was a conscious decision to make it more obvious what the '+' quantifier referred to.
You're correct in saying that 's/\037+$//' would have worked as well as 's/[\037]+$//'.
| [reply] [d/l] [select] |
Re^2: Suggestions to make this code more Perlish
by ricardo_sdl (Novice) on Mar 31, 2014 at 16:04 UTC
|
Wow! There is a lot that I need to learn about regular expressions to even begin to understand the code. Time to hit the docs! Anyway thanks kcott, TheDamian and the others for your suggestions.
| [reply] |
Re^2: Suggestions to make this code more Perlish
by ricardo_sdl (Novice) on Mar 31, 2014 at 19:53 UTC
|
One more question. When dealing with utf-8 files, is there no need to indicate this to the Perl interpreter? Reading the file, writing the file contents and matching using regular expressions will just work out of the box with utf-8 encoded strings?
| [reply] |
|
Probably the first thing to note is that the numeric value of 7-bit ASCII characters is the same as the UTF-8 code points for the same Unicode characters.
The ASCII character "A" has the hexidecimal value 41; The UTF-8 character "A" has the hexidecimal value 41.
The term Unicode is often used in a sense to indicate characters outside the range of 7-bit ASCII characters.
This often degenerates into arguments over what was said, what was meant, what's techinically correct and so on.
For the remainder of this node, assume ASCII refers to 7-bit ASCII characters and Unicode refers to UTF-8 characters outside the range of 7-bit ASCII characters.
In the simplest case, if your Perl source code, input data and output data contain only ASCII characters, there's no need to do anything special.
This was the case with your code and data here: ASCII characters were used throughout so no special pragmata or encoding directives were required.
[For the following examples, note that the letter A has a numerical value of 65 decimal (41 hexidecimal) and the smiley face character ☺ has a numerical value of 9786 decimal (263a hexidecimal).]
Here's a short piece of Perl code with just ASCII characters:
$ perl -E 'say ord "A"; say sprintf "%x", ord "A"'
65
41
Here's a similar piece of Perl code but this also includes Unicode characters:
$ perl -E 'say ord "☺"; say sprintf "%x", ord "☺"'
226
e2
As you can see, that second example didn't work very well: it produced unexpected results.
Because the source code contained Unicode characters, you need to tell Perl this with the utf8 pragma:
$ perl -E 'use utf8; say ord "☺"; say sprintf "%x", ord "☺"'
9786
263a
Here's a short piece of Perl code which outputs ASCII characters:
$ perl -E 'say chr 65; say "\x{41}"'
A
A
Here's a similar piece of Perl code which outputs Unicode characters:
$ perl -E 'say chr 9786; say "\x{263a}"'
Wide character in say at -e line 1.
☺
Wide character in say at -e line 1.
☺
As you can see, that produced warnings; however, if we let Perl know to expect Unicode output with binmode, we get a better result:
$ perl -E 'binmode STDOUT => ":utf8"; say chr 9786; say "\x{263a}"'
☺
☺
Those were just trivial examples.
See the documentation for details.
I suggest you start with:
And then move on to:
[Note: In order to display Unicode characters exactly as coded or output, I've used <pre>...</pre> and <tt>...</tt> tags instead of <code>...</code> and <c>...</c> tags.]
| [reply] [d/l] [select] |
|
|