Re: Suggestions to make this code more Perlish
by kcott (Archbishop) on Mar 30, 2014 at 02:40 UTC
|
G'day ricardo_sdl,
Welcome to the monastery.
Here's a shorter way to do it. It's 14 lines (yours is 75 lines).
It's definitely Perlish;
I'll leave others to comment on whether it's more Perlish. :-)
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use autodie;
local $\ = chr 30;
open my $csv_fh, '<', 'pm_1080161_input.csv';
open my $tff_fh, '>', 'pm_1080161_output.tff';
while (<$csv_fh>) {
chomp;
s/(?:"(?<a>[^"]*)"|(?<a>[^,]*)),?/$+{a}\037/g;
s/[\037]+$//;
print $tff_fh $_;
}
Update:
Just to summarise the discussion from the six nodes that follow this, here's an improved solution:
#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use autodie;
local $\ = chr 30;
open my $csv_fh, '<', 'pm_1080161_input.csv';
open my $tff_fh, '>', 'pm_1080161_output.tff';
my $re = qr{ " (?<field> [^"]* ) " | (?<field> [^,]* ) }x;
print $tff_fh $_ for map { chomp; s/$re,/$+{field}\037/g; $_ } <$csv_f
+h>;
End update.
See perlre for documentation on any of regex code.
Ask if there's anything else you don't understand.
And here's a shorter way to view the output.
It's one line vs. your 29 line script (in Re^2: Suggestions to make this code more Perlish).
$ perl -pe 'y/\036\037/\012|/' pm_1080161_output.tff
Country|Name|Address|Age
Andorra|Aileen, Daquan, Hammett, Malachi|17014|69
Solomon Islands|Cyrus, Giacomo, Gretchen, Curran|76935|26
Czech Republic|Briar, Larissa, Sybil, Colin|29565|88
Japan|Wyatt, Gavin, Derek, Coby|10734|52
Gabon|Tobias, Maya, April, Quintessa|77397|95
The rest of the output is in the spoiler:
| [reply] [d/l] [select] |
|
Just for interest, here's the same approach in Perl 6:
use v6;
my regex fields { \" <( .*? )> \" | <-[,"]>* }
my $input = open 'input.csv', :r;
my $output = open 'output.tff', :w;
for $input.lines {
next unless /<fields>* % ','/;
$output.print: @<fields>.join(chr 31) ~ chr 30;
}
$output.close;
That's not the shortest way to do it in Perl 6, but it's closest to the Perl 5 example above.
A shorter and more Perl6-ish version might be:
my regex fields { \" <( .*? )> \" | <-[,"]>* }
lines open 'input.csv'
==> map({@<fields>.join(chr 31) ~ chr 30 if /<fields>* % ','/})
==> spurt('output.tff');
Damian
| [reply] [d/l] [select] |
|
Thanks for that Damian.
I'm not really across Perl6 syntax.
I looked in Perl6 Regexes documentation; unfortunately, there's several sections with nothing more than "TODO", including "Alternation" and "Grouping and Capturing", so I pretty much gave up at that point.
Can you suggest a better source of documentation?
Anyway, inspired by your "shorter and more Perl6-ish version", here's a shorter and more Perl5-ish version of my original (this replaces the while loop, everything else remains the same):
my $re = qr{(?:"(?<a>[^"]*)"|(?<a>[^,]*))(?:,|\000)};
print $tff_fh $_ for map { chomp; s/$re/$+{a}\037/g; $_ } <$csv_fh>;
Due to the issue described in "Repeated Patterns Matching a Zero-length Substring", I was getting '\037\037' (at the end of $_) after each 's///g': hence the 's/[\037]+$//;' to remove them.
I found that by replacing ',?' with '(?:,|\000)', I got zero '\037' characters after the 's///g' (so the 's/[\037]+$//;' wasn't needed at all).
[Note: '(?:,|)', '(?:,|$)', '(?:,|\z)' and '(?:,|\Z)' all produced '\037\037' after each 's///g'.]
While I suspect this has something to do with '\0' terminated strings in C, I don't fully understand what's happening.
As it could be a side effect that might behave differently in another Perl version (I'm using v5.18.1), and not being able to answer the inevitable "How does this work?" question, I left it out of my original solution.
You, or someone else, may have a quick answer.
If not, I was planning to spend a bit more time looking into this and, in the absence of finding a solution, post a more generalised example with a question later in the week.
| [reply] [d/l] [select] |
|
|
|
|
Ken, your original solution handles lines like this properly:
Bangladesh,"Claire, Laura, Phillip, Pascale",70016,"5,6,7,8"
whereas s/$re,/$+{field}\037/g will never dequote the last field.
One other minor observation: \037 is an atom, so no brackets are needed in s/[\037]+$//; | [reply] [d/l] [select] |
|
Thanks for taking the time to study my code and provide feedback.
"your original solution handles lines like this properly:
Bangladesh,"Claire, Laura, Phillip, Pascale",70016,"5,6,7,8"
whereas s/$re,/$+{field}\037/g will never dequote the last field."
I made reference to the fact that we're dealing with "very specific input" in "Re^4: Suggestions to make this code more Perlish".
I also alluded to the fact that different code would be required if the input changed.
You're correct in saying that it "will never dequote the last field"; however, the last field (in this "very specific input") is an unquoted integer, so no dequoting is required.
"One other minor observation: \037 is an atom, ..."
Yes, that's how I used \037 (along with \036 and \012) in:
$ perl -pe 'y/\036\037/\012|/' pm_1080161_output.tff
"... so no brackets are needed in s/[\037]+$//;"
The use of brackets here was a conscious decision to make it more obvious what the '+' quantifier referred to.
You're correct in saying that 's/\037+$//' would have worked as well as 's/[\037]+$//'.
| [reply] [d/l] [select] |
|
| [reply] |
|
One more question. When dealing with utf-8 files, is there no need to indicate this to the Perl interpreter? Reading the file, writing the file contents and matching using regular expressions will just work out of the box with utf-8 encoded strings?
| [reply] |
|
Probably the first thing to note is that the numeric value of 7-bit ASCII characters is the same as the UTF-8 code points for the same Unicode characters.
The ASCII character "A" has the hexidecimal value 41; The UTF-8 character "A" has the hexidecimal value 41.
The term Unicode is often used in a sense to indicate characters outside the range of 7-bit ASCII characters.
This often degenerates into arguments over what was said, what was meant, what's techinically correct and so on.
For the remainder of this node, assume ASCII refers to 7-bit ASCII characters and Unicode refers to UTF-8 characters outside the range of 7-bit ASCII characters.
In the simplest case, if your Perl source code, input data and output data contain only ASCII characters, there's no need to do anything special.
This was the case with your code and data here: ASCII characters were used throughout so no special pragmata or encoding directives were required.
[For the following examples, note that the letter A has a numerical value of 65 decimal (41 hexidecimal) and the smiley face character ☺ has a numerical value of 9786 decimal (263a hexidecimal).]
Here's a short piece of Perl code with just ASCII characters:
$ perl -E 'say ord "A"; say sprintf "%x", ord "A"'
65
41
Here's a similar piece of Perl code but this also includes Unicode characters:
$ perl -E 'say ord "☺"; say sprintf "%x", ord "☺"'
226
e2
As you can see, that second example didn't work very well: it produced unexpected results.
Because the source code contained Unicode characters, you need to tell Perl this with the utf8 pragma:
$ perl -E 'use utf8; say ord "☺"; say sprintf "%x", ord "☺"'
9786
263a
Here's a short piece of Perl code which outputs ASCII characters:
$ perl -E 'say chr 65; say "\x{41}"'
A
A
Here's a similar piece of Perl code which outputs Unicode characters:
$ perl -E 'say chr 9786; say "\x{263a}"'
Wide character in say at -e line 1.
☺
Wide character in say at -e line 1.
☺
As you can see, that produced warnings; however, if we let Perl know to expect Unicode output with binmode, we get a better result:
$ perl -E 'binmode STDOUT => ":utf8"; say chr 9786; say "\x{263a}"'
☺
☺
Those were just trivial examples.
See the documentation for details.
I suggest you start with:
And then move on to:
[Note: In order to display Unicode characters exactly as coded or output, I've used <pre>...</pre> and <tt>...</tt> tags instead of <code>...</code> and <c>...</c> tags.]
| [reply] [d/l] [select] |
Re: Suggestions to make this code more Perlish
by NetWallah (Canon) on Mar 29, 2014 at 06:18 UTC
|
In addition to the Regex suggestion - I'd suggest avoiding the @lines altogether.
Instead of "push @lines, $line (Storing it), why don't you call convert_to_ttf($line) at that point - this way, you can avoid the store .. and .. process .
Similarly, instead of storing @converted lines, then printing them, do that at the same place - so your code looks like:
open my $fh ... etc
while (<>) {
$line .= $_;
if (is_complete_line($line)) {
print $fh convert_to_ttf ($line);
$line = '';
}
}
close $fh;
What is the sound of Perl? Is it not the sound of a wall that people have stopped banging their heads against?
-Larry Wall, 1992
| [reply] [d/l] |
|
| [reply] |
Re: Suggestions to make this code more Perlish
by Laurent_R (Canon) on Mar 29, 2014 at 22:27 UTC
|
Alright, I sort of understand now.
This is my "more Perlish" proposal:
#!/usr/bin/perl
use strict;
use warnings;
my $first_line = <DATA>;
chomp $first_line;
$first_line =~ s/,/chr(31)/eg;
print $first_line, chr(30);
while (<DATA>) {
chomp;
my @fields = split /,"|",/;
$fields[2] =~ s/,/chr(31)/eg;
print join chr(31), @fields, chr(30);
}
__DATA__
Country,Name,Address,Age
Andorra,"Aileen, Daquan, Hammett, Malachi",17014,69
Solomon Islands,"Cyrus, Giacomo, Gretchen, Curran",76935,26
Czech Republic,"Briar, Larissa, Sybil, Colin",29565,88
(...)
As such, the program produces the same strange looking output as the one I've shown before:
CountryNameAddressAgeAndorraAileen, Daquan, Hammett,
Malachi1701469Solomon IslandsCyrus, Giacomo, Gretchen,
Curran7693526Czech RepublicBriar,
But once piped to the other program that you have shown, this produces, I believe, the same output as your program piped into the same second program:
Country|Name|Address|Age
Andorra|Aileen, Daquan, Hammett, Malachi|17014|69|
Solomon Islands|Cyrus, Giacomo, Gretchen, Curran|76935|26|
Czech Republic|Briar, Larissa, Sybil, Colin|29565|88|
Japan|Wyatt, Gavin, Derek, Coby|10734|52|
(...)
The full output is displayed below:
As you can see, "more Perlish" often means much shorter code (although I really did not try to make it particularly concise and I decided, for simplicity, to process the header line separately. If you ignore that and forget about boiler plate code at the beginning, the real program holds in 5 lines). I also think that, on very large input, my version is most probably considerably faster.
I'll be glad to give more information if there is anything you don't understand.
| [reply] [d/l] [select] |
|
my $first_line = <DATA>;
chomp $first_line;
$first_line =~ s/,/chr(31)/eg;
print $first_line, chr(30);
Your join is joining the fields and the record separator. Just add parentheses to
print join chr(31), @fields, chr(30);
like this
print join(chr(31), @fields), chr(30);
That way, the field separators will just separate the fields. :-)
You're also not adequately handling a quoted field at the start of a record or two quoted fields adjacent to each other.
Here's two examples from your output (I think these are the only ones):
...
"Bonaire, Sint Eustatius and Saba|"Charissa, Lana, Liberty, Quail|4451
+7|18|
...
"Virgin Islands, British|"Otto, Macon, Caldwell, Sasha|87676|49|
...
Here's the matching lines from the input:
...
"Bonaire, Sint Eustatius and Saba","Charissa, Lana, Liberty, Quail",44
+517,18
...
"Virgin Islands, British","Otto, Macon, Caldwell, Sasha",87676,49
...
| [reply] [d/l] [select] |
|
Thanks for your comments, Ken. Yes, you are right, this is a 10-minutes solution, certainly not a polished one.
I needed to process the first line differently because of the way I chose to process the other lines, which would not work for the first one, but it is certainly possible to find another way to process the lines that would also work for the first one. However, when I have a header line that needs to be processed differently than the rest of the file, I often prefer to process it before starting to loop on the rest of the file, because the algorithm is then simpler (and often faster, which matters if the file is large).
You are right on the "join" line, it adds a field separator at the end of the records. That did not shock me, but it is indeed different from the output produced by the code in the original post. Adding parens (as per your proposal) solves the issue.
I had not even seen that there were two "irregular" lines with quoted fields at the beginning of the records in the input data, and that of course is a serious problem because it probably means that the whole algorithm has to be modified. BTW, this is a good example of why using a module such as Text::CSV is often better than doing one's own solution.
| [reply] |
|
Re: Suggestions to make this code more Perlish
by Laurent_R (Canon) on Mar 29, 2014 at 16:24 UTC
|
Hi, I think that what you are trying to do could be done in a dozen lines of code, or so (or even less), and I would gladly suggest a short "Perlish" solution, but I can't make sense of the messy output format generated by your code with your input data:
CountryNameAddressAgeAndorraAileen, Daquan, Hammett,
Malachi1701469Solomon IslandsCyrus, Giacomo, Gretchen,
Curran7693526Czech RepublicBriar, (...)
(Carriage returns added to the generated output to avoid one very very long line.) This is surely not the output that you want. Please show the output that you really need.
| [reply] [d/l] |
|
That is the intended output. The ascii character 30(Record Separator) is used as the separator between each line. The ascii character 31(Unit Separator) is used as the separator between each column.
You can use this script to show the output using new lines and pipe characters which make it easier to read.
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use utf8;
use open qw(:std :utf8);
#Print a text file format to standard output.
#https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii
+-delimited-text-not-csv-or-tab-delimited-text
sub print_line {
my ($line) = @_;
foreach(split('', $line)) {
if ($_ eq "\x{1F}") {
print "|";
next;
}
elsif($_ eq "\x{1E}") {
print "\n";
next;
}
print $_;
}
}
while (<>) {
print_line($_);
print "\n";
}
| [reply] [d/l] |
Re: Suggestions to make this code more Perlish
by Anonymous Monk on Mar 28, 2014 at 21:41 UTC
|
| [reply] |
|
| [reply] |