Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Sorting an array of strings when some of the strings have commas in them?

by perldigious (Priest)
on Dec 11, 2015 at 20:27 UTC ( [id://1150070]=perlquestion: print w/replies, xml ) Need Help??

perldigious has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm seeing some unexpected/undesired behavior from some "sort" code I wrote, and I was hoping the Perl Monks could help me understand/solve my problem.

Sparing what I'm hoping are unnecessary details, I basically have a large tab delimited spreadsheet and I'm trying to sort the rows according to the data in a couple of the columns. Here is the relevant code snippet.

use utf8; use 5.022; use strict; # Open the file for input, discard all the header lines, but stop on a +nd save the column names in an array. # Read all the remaining lines in to an array and close the file. open($fh, "<", $filename) or die "Cannot open \"$filename\" for input: + $!\n"; my $column_line = ""; $column_line = <$fh> while !($column_line =~ /ABC RBC Name/i); chomp(my @columns = split /\t/, $column_line); chomp(my @data_lines = <$fh>); close $fh; # Sort the data lines according to the "Company Name" field and then t +he "Invoice ID" field. my ($company_name_index) = grep { $columns[$_] eq "Company Name" } (0. +.$#columns); my ($invoice_ID_index) = grep { $columns[$_] eq "Invoice ID" } (0..$#c +olumns); @data_lines = sort { my($company_name_a, $invoice_ID_a) = (split /\t/, $a)[$company_nam +e_index, $invoice_ID_index]; my($company_name_b, $invoice_ID_b) = (split /\t/, $b)[$company_nam +e_index, $invoice_ID_index]; fc($company_name_a) cmp fc($company_name_b) or $invoice_ID_a <=> $invoice_ID_b } @data_lines; # Open the file for output, print a new header and the column line to +it, then print the now sorted data to it and close the file. open($fh, ">", $filename) or die "Cannot open \"$filename\" for output +: $!\n"; print $fh "Replacement Header Text Here\n\n$column_line"; print $fh "$_\n" foreach @data_lines;

The unexpected/undesired behavior is that the attempted first level sort by company name puts any company with a comma in its name at the top of the list. So for the example of the following data set of company names:

SEALEVEL SYSTEMS SEALEVEL SYSTEMS, INC. SEBASTIAN COMMUNICATIONS MASQUE SOUND MASSTECH, INC MASTERBILT SE INTERNATIONAL

The sort will give:

MASSTECH, INC SEALEVEL SYSTEMS, INC. MASQUE SOUND MASTERBILT SE INTERNATIONAL SEALEVEL SYSTEMS SEBASTIAN COMMUNICATIONS

When what's actually desired would be this...

MASQUE SOUND MASSTECH, INC MASTERBILT SE INTERNATIONAL SEALEVEL SYSTEMS SEALEVEL SYSTEMS, INC. SEBASTIAN COMMUNICATIONS

I'm hoping I'm just missing something obvious, but why does it always put the ones with commas at the top? Is there a simple way to get it to sort the way I'd like instead?

UPDATE: Ha, well I found the problem after wasting a day on this. Sorry guys, I appreciate all the replies, but this turned out to be a case of Microsoft Excel being the bane of my existence once again. Turns out the database tool that originally compiles all this data puts double quotes around all those names that have a comma in them. But good ol' Excel naturally decides to remove them when you open up a tab delimited file in it. Due to the sheer amount of the data, I'd only been viewing it in Excel to make it easier to sift through and make sure my program was doing the things I thought it was doing. When I copied and pasted certain cells form Excel to notepad to then post here in the forum, there were no double quotes because Excel had already removed them. However, when I just open up the tab delimited text file in a notepad without Excel's interference, "hey, look at that, double quotes are around all those things... Does a double quote come first in asciibetical order? Yup, sure does."

UPDATE: Sorry again guys, I hope I didn't waste too much of your time. Not a very promising first post for me, I know.

Replies are listed 'Best First'.
Re: Sorting an array of strings when some of the strings have commas in them?
by GrandFather (Saint) on Dec 11, 2015 at 23:24 UTC

    Of course the correct answer is "use Text::CSV":

    #!/usr/bin/perl use strict; use warnings; use Text::CSV; use 5.016; my $csv = Text::CSV->new({binary => 1, sep_char => "\t"}); my $row; 1 while $row = $csv->getline(*DATA) and $row->[0] !~ /Invoice ID/; my @columns = @$row; my @data_lines; push @data_lines, $row while $row = $csv->getline(*DATA); # Sort data according to "Company Name" field then "Invoice ID" field. my ($company_name_index) = grep {$columns[$_] eq "Company Name"} (0 .. $#columns); my ($invoice_ID_index) = grep {$columns[$_] eq "Invoice ID"} (0 .. $#c +olumns); @data_lines = sort { fc($a->[$company_name_index]) cmp fc($b->[$company_name_index]) or $a->[$invoice_ID_index] <=> $a->[$invoice_ID_index] } @data_lines; # Print a new header and the column line print "Replacement Header Text Here\n\n", join ("\t", @columns), "\n"; # then print the sorted data print join ("\t", @$_), "\n" foreach @data_lines; __DATA__ Original Header Text Invoice ID Company Name 1 SEBASTIAN COMMUNICATIONS 2 MASQUE SOUND 3 SEALEVEL SYSTEMS 4 "MASSTECH, INC" 5 SE INTERNATIONAL 6 "SOUTHEAST SERVO, LLC" 7 "SEALEVEL SYSTEMS, INC." 8 MASTERBILT

    Prints:

    Replacement Header Text Here Invoice ID Company Name 2 MASQUE SOUND 4 MASSTECH, INC 8 MASTERBILT 5 SE INTERNATIONAL 3 SEALEVEL SYSTEMS 7 SEALEVEL SYSTEMS, INC. 1 SEBASTIAN COMMUNICATIONS 6 SOUTHEAST SERVO, LLC

    Update: and of course Text::CSV should be used to correctly output the data too:

    # Print a new header and the column line to it print "Replacement Header Text Here\n\n"; # print the sorted data. print render($csv, @columns), "\n"; print render($csv, @$_), "\n" foreach @data_lines; sub render { my ($csv, @cells) = @_; $csv->combine(@cells); return $csv->string(); }

    with modiffied output:

    Replacement Header Text Here "Invoice ID" "Company Name" 2 "MASQUE SOUND" 4 "MASSTECH, INC" 8 MASTERBILT 5 "SE INTERNATIONAL" 3 "SEALEVEL SYSTEMS" 7 "SEALEVEL SYSTEMS, INC." 1 "SEBASTIAN COMMUNICATIONS" 6 "SOUTHEAST SERVO, LLC"
    Premature optimization is the root of all job security
Re: Sorting an array of strings when some of the strings have commas in them?
by GrandFather (Saint) on Dec 11, 2015 at 22:27 UTC

    If I turn your code into something to exercise your sort the first thing I find is that fc() is missing. Removing that the sort performs as expected:

    #!/usr/bin/perl use strict; use warnings; chomp (my @data_lines = <DATA>); # Sort the data lines according to the "Company Name" field and then t +he "Invoice ID" field. @data_lines = sort { my ($company_name_a, $invoice_ID_a) = (split /~/, $a)[0, 1]; my ($company_name_b, $invoice_ID_b) = (split /~/, $b)[0, 1]; ($company_name_a) cmp ($company_name_b) or $invoice_ID_a <=> $invoice_ID_b } @data_lines; print "$_\n" for @data_lines; __DATA__ SEALEVEL SYSTEMS~1 SEALEVEL SYSTEMS, INC.~2 SEBASTIAN COMMUNICATIONS~3 MASQUE SOUND~4 MASSTECH, INC~5 MASTERBILT~6 SE INTERNATIONAL~7

    Prints:

    MASQUE SOUND~4 MASSTECH, INC~5 MASTERBILT~6 SE INTERNATIONAL~7 SEALEVEL SYSTEMS~1 SEALEVEL SYSTEMS, INC.~2 SEBASTIAN COMMUNICATIONS~3

    You will notice that I replaced tabs with ~ to make the column separators easier to see.

    So maybe you'd like to update my sample code to show your actual problem?

    Update: ah, I see fc is "new" as of about Perl 5.16 and I hadn't "turned it on" even though using Perl 5.20 :(. Restoring fc to the sort doesn't alter the result.

    Premature optimization is the root of all job security
Re: Sorting an array of strings when some of the strings have commas in them?
by GrandFather (Saint) on Dec 11, 2015 at 22:50 UTC

    Test code updated to use sample data supplied:

    #!/usr/bin/perl use strict; use warnings; use 5.016; chomp (my @data_lines = <DATA>); # Sort the data lines according to the "Company Name" field and then t +he "Invoice ID" field. @data_lines = sort { my ($company_name_a, $invoice_ID_a) = (split /\t/, $a)[1, 0]; my ($company_name_b, $invoice_ID_b) = (split /\t/, $b)[1, 0]; fc($company_name_a) cmp fc($company_name_b) or $invoice_ID_a <=> $invoice_ID_b } @data_lines; print "$_\n" for @data_lines; __DATA__ 1 SEBASTIAN COMMUNICATIONS 2 MASQUE SOUND 3 SEALEVEL SYSTEMS 4 MASSTECH, INC 5 SE INTERNATIONAL 6 SOUTHEAST SERVO, LLC 7 SEALEVEL SYSTEMS, INC. 8 MASTERBILT

    Prints:

    2 MASQUE SOUND 4 MASSTECH, INC 8 MASTERBILT 5 SE INTERNATIONAL 3 SEALEVEL SYSTEMS 7 SEALEVEL SYSTEMS, INC. 1 SEBASTIAN COMMUNICATIONS 6 SOUTHEAST SERVO, LLC

    It ain't the sort. In fact making minimal changes to your sample code to use internal data:

    #!/usr/bin/perl use strict; use warnings; use 5.016; # Open the file for input, discard all the header lines, but stop on a +nd save the column names in an array. # Read all the remaining lines in to an array and close the file. my $column_line = ""; $column_line = <DATA> while defined $column_line && !($column_line =~ +/Invoice ID/i); chomp(my @columns = split /\t/, $column_line); chomp(my @data_lines = <DATA>); # Sort the data lines according to the "Company Name" field and then t +he "Invoice ID" field. my ($company_name_index) = grep { $columns[$_] eq "Company Name" } (0. +.$#columns); my ($invoice_ID_index) = grep { $columns[$_] eq "Invoice ID" } (0..$#c +olumns); @data_lines = sort { my($company_name_a, $invoice_ID_a) = (split /\t/, $a)[$company_nam +e_index, $invoice_ID_index]; my($company_name_b, $invoice_ID_b) = (split /\t/, $b)[$company_nam +e_index, $invoice_ID_index]; fc($company_name_a) cmp fc($company_name_b) or $invoice_ID_a <=> $invoice_ID_b } @data_lines; # Open the file for output, print a new header and the column line to +it, then print the now sorted data to it and close the file. print "Replacement Header Text Here\n\n$column_line"; print "$_\n" foreach @data_lines; __DATA__ Original Header Text Invoice ID Company Name 1 SEBASTIAN COMMUNICATIONS 2 MASQUE SOUND 3 SEALEVEL SYSTEMS 4 MASSTECH, INC 5 SE INTERNATIONAL 6 SOUTHEAST SERVO, LLC 7 SEALEVEL SYSTEMS, INC. 8 MASTERBILT

    prints:

    Replacement Header Text Here Invoice ID Company Name 2 MASQUE SOUND 4 MASSTECH, INC 8 MASTERBILT 5 SE INTERNATIONAL 3 SEALEVEL SYSTEMS 7 SEALEVEL SYSTEMS, INC. 1 SEBASTIAN COMMUNICATIONS 6 SOUTHEAST SERVO, LLC

    so the issue isn't in any of the code you've supplied or the sample data you have supplied.

    Your ball ...

    Premature optimization is the root of all job security

      Thanks, I figured it out and it was a rather dumb problem not even Perl related. I updated my original post to explain. I appreciate you taking the time to try and help me out, it was actually when I was trying to test your sample code that I randomly stumbled on to my answer. I opened up my little test file to change the tabs to tildes and saw the mysterious double quotes that had been previously missing when viewed in Excel.

Re: Sorting an array of strings when some of the strings have commas in them?
by choroba (Cardinal) on Dec 11, 2015 at 21:20 UTC
    Could you show the full data? It seems to sort correctly for me on the data I created from your example:
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say fc }; my $company_name_index = 0; my $invoice_ID_index = 1; say for sort { my ($company_name_a, $invoice_ID_a) = (split /\t/, $a)[$company_na +me_index, $invoice_ID_index]; my ($company_name_b, $invoice_ID_b) = (split /\t/, $b)[$company_na +me_index, $invoice_ID_index]; fc($company_name_a) cmp fc($company_name_b) or $invoice_ID_a <=> $invoice_ID_b } map "$_\t" . rand, ( 'SEALEVEL SYSTEMS', 'SEALEVEL SYSTEMS, INC.', 'SEBASTIAN COMMUNICATIONS', 'MASQUE SOUND', 'MASSTECH, INC', 'MASTERBILT', 'SE INTERNATIONAL', );
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Sorry, I can't provide the full data due to that pesky confidentiality stuff, and I know that's not very helpful. Hopefully this will be just as good. Here is the same code tweaked a little to be more stand alone.

      #!/usr/bin/perl use utf8; use 5.022; use strict; # Open the file for input, discard all the header lines, but stop on a +nd save the column names in an array. # Then read all the remaining lines in to an array and close the file. my $filename = "test.txt"; my $fh; open($fh, "<", $filename) or die "Cannot open \"$filename\" for input: + $!\n"; my $column_line = ""; $column_line = <$fh> while !($column_line =~ /Invoice ID/i); chomp(my @columns = split /\t/, $column_line); chomp(my @data_lines = <$fh>); close $fh; # Sort the data lines according to the "Company Name" field, and then +the "Invoice ID" field. my ($company_name_index) = grep { $columns[$_] eq "Company Name" } (0. +.$#columns); my ($invoice_ID_index) = grep { $columns[$_] eq "Invoice ID" } (0..$#c +olumns); @data_lines = sort { my($company_name_a, $invoice_ID_a) = (split /\t/, $a)[$company_nam +e_index, $invoice_ID_index]; my($company_name_b, $invoice_ID_b) = (split /\t/, $b)[$company_nam +e_index, $invoice_ID_index]; fc($company_name_a) cmp fc($company_name_b) or $invoice_ID_a <=> $invoice_ID_b } @data_lines; # Open the file for output, print the standard header to it, print the + now sorted data to it, then close the file. open($fh, ">", $filename) or die "Cannot open \"$filename\" for output +: $!\n"; print $fh "Replacement Header Text Here\n\n$column_line"; print $fh "$_\n" foreach @data_lines;

      And when I run that with the exact original contents of the "test.txt" file as follows:

      Original Header Text Invoice ID Company Name 1 SEBASTIAN COMMUNICATIONS 2 MASQUE SOUND 3 SEALEVEL SYSTEMS 4 MASSTECH, INC 5 SE INTERNATIONAL 6 SOUTHEAST SERVO, LLC 7 SEALEVEL SYSTEMS, INC. 8 MASTERBILT

      Then the resulting modified "test.txt" file is:

      Replacement Header Text Here Invoice ID Company Name 4 MASSTECH, INC 7 SEALEVEL SYSTEMS, INC. 6 SOUTHEAST SERVO, LLC 2 MASQUE SOUND 8 MASTERBILT 5 SE INTERNATIONAL 3 SEALEVEL SYSTEMS 1 SEBASTIAN COMMUNICATIONS

      Does that help? Or rather, help you help me?

        No, it doesn't help. It still sorts correctly: My guess: the indices are wrong. Your code sorts the file by a different column.
        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Sorting an array of strings when some of the strings have commas in them?
by Cristoforo (Curate) on Dec 11, 2015 at 22:10 UTC
    Without a sample file with starting headers, column headers and data lines, I'm just making a guess, but I think you will get your data sorted if you add a line.
    tr/,.//d for $company_name_a, $company_name_b; fc($company_name_a) cmp fc($company_name_b) or $invoice_ID_a <=> $invoice_ID_b
    This transliteration would delete commas and periods from the company name before the sort comparisons.

    Hope this helps.

      Thanks for the reply. It had occurred to me to remove the commas prior to doing the sort, or rather to make local copies of them to remove commas from and then sort by those modified local copies without changing the original data. I guess I was just hoping for an explanation to why it sorts the way it does for me as is, and for a fix that was more straightforward.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1150070]
Approved by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-03-28 21:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found