perldigious has asked for the wisdom of the Perl Monks concerning the following question:
Hi, I'm seeing some unexpected/undesired behavior from some "sort" code I wrote, and I was hoping the Perl Monks could help me understand/solve my problem.
Sparing what I'm hoping are unnecessary details, I basically have a large tab delimited spreadsheet and I'm trying to sort the rows according to the data in a couple of the columns. Here is the relevant code snippet.
use utf8; use 5.022; use strict;
# Open the file for input, discard all the header lines, but stop on a
+nd save the column names in an array.
# Read all the remaining lines in to an array and close the file.
open($fh, "<", $filename) or die "Cannot open \"$filename\" for input:
+ $!\n";
my $column_line = "";
$column_line = <$fh> while !($column_line =~ /ABC RBC Name/i);
chomp(my @columns = split /\t/, $column_line);
chomp(my @data_lines = <$fh>);
close $fh;
# Sort the data lines according to the "Company Name" field and then t
+he "Invoice ID" field.
my ($company_name_index) = grep { $columns[$_] eq "Company Name" } (0.
+.$#columns);
my ($invoice_ID_index) = grep { $columns[$_] eq "Invoice ID" } (0..$#c
+olumns);
@data_lines = sort
{
my($company_name_a, $invoice_ID_a) = (split /\t/, $a)[$company_nam
+e_index, $invoice_ID_index];
my($company_name_b, $invoice_ID_b) = (split /\t/, $b)[$company_nam
+e_index, $invoice_ID_index];
fc($company_name_a) cmp fc($company_name_b) or
$invoice_ID_a <=> $invoice_ID_b
} @data_lines;
# Open the file for output, print a new header and the column line to
+it, then print the now sorted data to it and close the file.
open($fh, ">", $filename) or die "Cannot open \"$filename\" for output
+: $!\n";
print $fh "Replacement Header Text Here\n\n$column_line";
print $fh "$_\n" foreach @data_lines;
The unexpected/undesired behavior is that the attempted first level sort by company name puts any company with a comma in its name at the top of the list. So for the example of the following data set of company names:
SEALEVEL SYSTEMS
SEALEVEL SYSTEMS, INC.
SEBASTIAN COMMUNICATIONS
MASQUE SOUND
MASSTECH, INC
MASTERBILT
SE INTERNATIONAL
The sort will give:
MASSTECH, INC
SEALEVEL SYSTEMS, INC.
MASQUE SOUND
MASTERBILT
SE INTERNATIONAL
SEALEVEL SYSTEMS
SEBASTIAN COMMUNICATIONS
When what's actually desired would be this...
MASQUE SOUND
MASSTECH, INC
MASTERBILT
SE INTERNATIONAL
SEALEVEL SYSTEMS
SEALEVEL SYSTEMS, INC.
SEBASTIAN COMMUNICATIONS
I'm hoping I'm just missing something obvious, but why does it always put the ones with commas at the top? Is there a simple way to get it to sort the way I'd like instead?
UPDATE: Ha, well I found the problem after wasting a day on this. Sorry guys, I appreciate all the replies, but this turned out to be a case of Microsoft Excel being the bane of my existence once again. Turns out the database tool that originally compiles all this data puts double quotes around all those names that have a comma in them. But good ol' Excel naturally decides to remove them when you open up a tab delimited file in it. Due to the sheer amount of the data, I'd only been viewing it in Excel to make it easier to sift through and make sure my program was doing the things I thought it was doing. When I copied and pasted certain cells form Excel to notepad to then post here in the forum, there were no double quotes because Excel had already removed them. However, when I just open up the tab delimited text file in a notepad without Excel's interference, "hey, look at that, double quotes are around all those things... Does a double quote come first in asciibetical order? Yup, sure does."
UPDATE: Sorry again guys, I hope I didn't waste too much of your time. Not a very promising first post for me, I know.
Re: Sorting an array of strings when some of the strings have commas in them?
by GrandFather (Saint) on Dec 11, 2015 at 23:24 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use 5.016;
my $csv = Text::CSV->new({binary => 1, sep_char => "\t"});
my $row;
1 while $row = $csv->getline(*DATA) and $row->[0] !~ /Invoice ID/;
my @columns = @$row;
my @data_lines;
push @data_lines, $row while $row = $csv->getline(*DATA);
# Sort data according to "Company Name" field then "Invoice ID" field.
my ($company_name_index) =
grep {$columns[$_] eq "Company Name"} (0 .. $#columns);
my ($invoice_ID_index) = grep {$columns[$_] eq "Invoice ID"} (0 .. $#c
+olumns);
@data_lines = sort {
fc($a->[$company_name_index]) cmp fc($b->[$company_name_index])
or $a->[$invoice_ID_index] <=> $a->[$invoice_ID_index]
} @data_lines;
# Print a new header and the column line
print "Replacement Header Text Here\n\n", join ("\t", @columns), "\n";
# then print the sorted data
print join ("\t", @$_), "\n" foreach @data_lines;
__DATA__
Original Header Text
Invoice ID Company Name
1 SEBASTIAN COMMUNICATIONS
2 MASQUE SOUND
3 SEALEVEL SYSTEMS
4 "MASSTECH, INC"
5 SE INTERNATIONAL
6 "SOUTHEAST SERVO, LLC"
7 "SEALEVEL SYSTEMS, INC."
8 MASTERBILT
Prints:
Replacement Header Text Here
Invoice ID Company Name
2 MASQUE SOUND
4 MASSTECH, INC
8 MASTERBILT
5 SE INTERNATIONAL
3 SEALEVEL SYSTEMS
7 SEALEVEL SYSTEMS, INC.
1 SEBASTIAN COMMUNICATIONS
6 SOUTHEAST SERVO, LLC
Update: and of course Text::CSV should be used to correctly output the data too:
# Print a new header and the column line to it
print "Replacement Header Text Here\n\n";
# print the sorted data.
print render($csv, @columns), "\n";
print render($csv, @$_), "\n" foreach @data_lines;
sub render {
my ($csv, @cells) = @_;
$csv->combine(@cells);
return $csv->string();
}
with modiffied output:
Replacement Header Text Here
"Invoice ID" "Company Name"
2 "MASQUE SOUND"
4 "MASSTECH, INC"
8 MASTERBILT
5 "SE INTERNATIONAL"
3 "SEALEVEL SYSTEMS"
7 "SEALEVEL SYSTEMS, INC."
1 "SEBASTIAN COMMUNICATIONS"
6 "SOUTHEAST SERVO, LLC"
Premature optimization is the root of all job security
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Sorting an array of strings when some of the strings have commas in them?
by GrandFather (Saint) on Dec 11, 2015 at 22:27 UTC
|
If I turn your code into something to exercise your sort the first thing I find is that fc() is missing. Removing that the sort performs as expected:
#!/usr/bin/perl
use strict;
use warnings;
chomp (my @data_lines = <DATA>);
# Sort the data lines according to the "Company Name" field and then t
+he "Invoice ID" field.
@data_lines = sort {
my ($company_name_a, $invoice_ID_a) = (split /~/, $a)[0, 1];
my ($company_name_b, $invoice_ID_b) = (split /~/, $b)[0, 1];
($company_name_a) cmp ($company_name_b)
or $invoice_ID_a <=> $invoice_ID_b
} @data_lines;
print "$_\n" for @data_lines;
__DATA__
SEALEVEL SYSTEMS~1
SEALEVEL SYSTEMS, INC.~2
SEBASTIAN COMMUNICATIONS~3
MASQUE SOUND~4
MASSTECH, INC~5
MASTERBILT~6
SE INTERNATIONAL~7
Prints:
MASQUE SOUND~4
MASSTECH, INC~5
MASTERBILT~6
SE INTERNATIONAL~7
SEALEVEL SYSTEMS~1
SEALEVEL SYSTEMS, INC.~2
SEBASTIAN COMMUNICATIONS~3
You will notice that I replaced tabs with ~ to make the column separators easier to see.
So maybe you'd like to update my sample code to show your actual problem?
Update: ah, I see fc is "new" as of about Perl 5.16 and I hadn't "turned it on" even though using Perl 5.20 :(. Restoring fc to the sort doesn't alter the result.
Premature optimization is the root of all job security
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Sorting an array of strings when some of the strings have commas in them?
by GrandFather (Saint) on Dec 11, 2015 at 22:50 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
use 5.016;
chomp (my @data_lines = <DATA>);
# Sort the data lines according to the "Company Name" field and then t
+he "Invoice ID" field.
@data_lines = sort {
my ($company_name_a, $invoice_ID_a) = (split /\t/, $a)[1, 0];
my ($company_name_b, $invoice_ID_b) = (split /\t/, $b)[1, 0];
fc($company_name_a) cmp fc($company_name_b)
or $invoice_ID_a <=> $invoice_ID_b
} @data_lines;
print "$_\n" for @data_lines;
__DATA__
1 SEBASTIAN COMMUNICATIONS
2 MASQUE SOUND
3 SEALEVEL SYSTEMS
4 MASSTECH, INC
5 SE INTERNATIONAL
6 SOUTHEAST SERVO, LLC
7 SEALEVEL SYSTEMS, INC.
8 MASTERBILT
Prints:
2 MASQUE SOUND
4 MASSTECH, INC
8 MASTERBILT
5 SE INTERNATIONAL
3 SEALEVEL SYSTEMS
7 SEALEVEL SYSTEMS, INC.
1 SEBASTIAN COMMUNICATIONS
6 SOUTHEAST SERVO, LLC
It ain't the sort. In fact making minimal changes to your sample code to use internal data:
#!/usr/bin/perl
use strict;
use warnings;
use 5.016;
# Open the file for input, discard all the header lines, but stop on a
+nd save the column names in an array.
# Read all the remaining lines in to an array and close the file.
my $column_line = "";
$column_line = <DATA> while defined $column_line && !($column_line =~
+/Invoice ID/i);
chomp(my @columns = split /\t/, $column_line);
chomp(my @data_lines = <DATA>);
# Sort the data lines according to the "Company Name" field and then t
+he "Invoice ID" field.
my ($company_name_index) = grep { $columns[$_] eq "Company Name" } (0.
+.$#columns);
my ($invoice_ID_index) = grep { $columns[$_] eq "Invoice ID" } (0..$#c
+olumns);
@data_lines = sort
{
my($company_name_a, $invoice_ID_a) = (split /\t/, $a)[$company_nam
+e_index, $invoice_ID_index];
my($company_name_b, $invoice_ID_b) = (split /\t/, $b)[$company_nam
+e_index, $invoice_ID_index];
fc($company_name_a) cmp fc($company_name_b) or
$invoice_ID_a <=> $invoice_ID_b
} @data_lines;
# Open the file for output, print a new header and the column line to
+it, then print the now sorted data to it and close the file.
print "Replacement Header Text Here\n\n$column_line";
print "$_\n" foreach @data_lines;
__DATA__
Original Header Text
Invoice ID Company Name
1 SEBASTIAN COMMUNICATIONS
2 MASQUE SOUND
3 SEALEVEL SYSTEMS
4 MASSTECH, INC
5 SE INTERNATIONAL
6 SOUTHEAST SERVO, LLC
7 SEALEVEL SYSTEMS, INC.
8 MASTERBILT
prints:
Replacement Header Text Here
Invoice ID Company Name
2 MASQUE SOUND
4 MASSTECH, INC
8 MASTERBILT
5 SE INTERNATIONAL
3 SEALEVEL SYSTEMS
7 SEALEVEL SYSTEMS, INC.
1 SEBASTIAN COMMUNICATIONS
6 SOUTHEAST SERVO, LLC
so the issue isn't in any of the code you've supplied or the sample data you have supplied.
Your ball ...
Premature optimization is the root of all job security
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Thanks, I figured it out and it was a rather dumb problem not even Perl related. I updated my original post to explain. I appreciate you taking the time to try and help me out, it was actually when I was trying to test your sample code that I randomly stumbled on to my answer. I opened up my little test file to change the tabs to tildes and saw the mysterious double quotes that had been previously missing when viewed in Excel.
| [reply] [Watch: Dir/Any] |
Re: Sorting an array of strings when some of the strings have commas in them?
by choroba (Cardinal) on Dec 11, 2015 at 21:20 UTC
|
Could you show the full data? It seems to sort correctly for me on the data I created from your example:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say fc };
my $company_name_index = 0;
my $invoice_ID_index = 1;
say for sort {
my ($company_name_a, $invoice_ID_a) = (split /\t/, $a)[$company_na
+me_index, $invoice_ID_index];
my ($company_name_b, $invoice_ID_b) = (split /\t/, $b)[$company_na
+me_index, $invoice_ID_index];
fc($company_name_a) cmp fc($company_name_b) or
$invoice_ID_a <=> $invoice_ID_b
} map "$_\t" . rand, (
'SEALEVEL SYSTEMS',
'SEALEVEL SYSTEMS, INC.',
'SEBASTIAN COMMUNICATIONS',
'MASQUE SOUND',
'MASSTECH, INC',
'MASTERBILT',
'SE INTERNATIONAL',
);
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Sorry, I can't provide the full data due to that pesky confidentiality stuff, and I know that's not very helpful. Hopefully this will be just as good. Here is the same code tweaked a little to be more stand alone.
#!/usr/bin/perl
use utf8; use 5.022; use strict;
# Open the file for input, discard all the header lines, but stop on a
+nd save the column names in an array.
# Then read all the remaining lines in to an array and close the file.
my $filename = "test.txt";
my $fh;
open($fh, "<", $filename) or die "Cannot open \"$filename\" for input:
+ $!\n";
my $column_line = "";
$column_line = <$fh> while !($column_line =~ /Invoice ID/i);
chomp(my @columns = split /\t/, $column_line);
chomp(my @data_lines = <$fh>);
close $fh;
# Sort the data lines according to the "Company Name" field, and then
+the "Invoice ID" field.
my ($company_name_index) = grep { $columns[$_] eq "Company Name" } (0.
+.$#columns);
my ($invoice_ID_index) = grep { $columns[$_] eq "Invoice ID" } (0..$#c
+olumns);
@data_lines = sort
{
my($company_name_a, $invoice_ID_a) = (split /\t/, $a)[$company_nam
+e_index, $invoice_ID_index];
my($company_name_b, $invoice_ID_b) = (split /\t/, $b)[$company_nam
+e_index, $invoice_ID_index];
fc($company_name_a) cmp fc($company_name_b) or
$invoice_ID_a <=> $invoice_ID_b
} @data_lines;
# Open the file for output, print the standard header to it, print the
+ now sorted data to it, then close the file.
open($fh, ">", $filename) or die "Cannot open \"$filename\" for output
+: $!\n";
print $fh "Replacement Header Text Here\n\n$column_line";
print $fh "$_\n" foreach @data_lines;
And when I run that with the exact original contents of the "test.txt" file as follows:
Original Header Text
Invoice ID Company Name
1 SEBASTIAN COMMUNICATIONS
2 MASQUE SOUND
3 SEALEVEL SYSTEMS
4 MASSTECH, INC
5 SE INTERNATIONAL
6 SOUTHEAST SERVO, LLC
7 SEALEVEL SYSTEMS, INC.
8 MASTERBILT
Then the resulting modified "test.txt" file is:
Replacement Header Text Here
Invoice ID Company Name
4 MASSTECH, INC
7 SEALEVEL SYSTEMS, INC.
6 SOUTHEAST SERVO, LLC
2 MASQUE SOUND
8 MASTERBILT
5 SE INTERNATIONAL
3 SEALEVEL SYSTEMS
1 SEBASTIAN COMMUNICATIONS
Does that help? Or rather, help you help me? | [reply] [Watch: Dir/Any] [d/l] [select] |
|
No, it doesn't help. It still sorts correctly:
My guess: the indices are wrong. Your code sorts the file by a different column.
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Sorting an array of strings when some of the strings have commas in them?
by Cristoforo (Curate) on Dec 11, 2015 at 22:10 UTC
|
Without a sample file with starting headers, column headers and data lines, I'm just making a guess, but I think you will get your data sorted if you add a line.
tr/,.//d for $company_name_a, $company_name_b;
fc($company_name_a) cmp fc($company_name_b) or
$invoice_ID_a <=> $invoice_ID_b
This transliteration would delete commas and periods from the company name before the sort comparisons.
Hope this helps. | [reply] [Watch: Dir/Any] [d/l] |
|
Thanks for the reply. It had occurred to me to remove the commas prior to doing the sort, or rather to make local copies of them to remove commas from and then sort by those modified local copies without changing the original data. I guess I was just hoping for an explanation to why it sorts the way it does for me as is, and for a fix that was more straightforward.
| [reply] [Watch: Dir/Any] |
|
|