Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Need to sort comma delimited, double quoted file

by CSharma (Sexton)
on Jul 25, 2017 at 09:22 UTC ( [id://1195965]=perlquestion: print w/replies, xml ) Need Help??

CSharma has asked for the wisdom of the Perl Monks concerning the following question:

Hi PMs! I've CSV file, comma delimited, embedded (, & ") in double quotes; as you can see in row given below.
6450458,6011,"Urine - Culture & Sensitivity",1658,"Colony Count:","10^ +3 cfu/ml",,2016-10-26 09:55:34,0,"", ,"",,2016-10-26 09:55:34,SS00002 +03,6,"All Tests Done and Verified",,SCIN72669,2016-10-24 12:04:58,21, +"Max Smart Super, Speciality Hospital "O"",3445,"Bansidhar Tarai ",0, +"SAVITRI DEVI",False,"SAVITRI ","DEVI",SKCT,334,20905,"Anjani Kuma +r Agrawal",1957-01-01 00:00:00,0,NULL,"7838457000","INFO@MAXHEALTHCA +RE.COM",OP,NO,Verified,2016-10-24 12:04:58,2016-10-24 12:04:58,Lab,24 +3981,0,"",F 6444885,21732,"Blood - Culture & Sensitivity",3147,"Method BacT/ALERT3 +D & Vitek 2","SubHead",,2016-10-26 09:00:11,1,"min", ,"",0,2016-10-26 + 09:00:11,PM0004746,6,"All Tests Done and Verified",,PMIN4335,2016-10 +-21 19:07:36,25,"PMC",3445,"Bansidhar Tarai ",0,"SUSHILA DEVI 94907", +False,"SUSHILA DEVI","94907",PMCL,1861,69142,"Parkash Gera",1961-01 +-01 00:00:00,0,NULL,"0143000000","INFO@MAXHEALTHCARE.COM",OP,NO,Verif +ied,2016-10-21 19:07:36,2016-10-21 19:07:36,Lab,781642,0,"",F 6444891,21732,"Blood - Culture & Sensitivity",3147,"Method BacT/ALERT3 +D & Vitek 2","SubHead",,2016-10-26 09:00:36,1,"min", ,"",0,2016-10-26 + 09:00:36,PM0004748,6,"All Tests Done and Verified",,PMIN4337,2016-10 +-21 19:11:24,25,"PMC",3445,"Bansidhar Tarai ",0,"TUSHAR BHATIA 94916" +,False,"TUSHAR BHATIA","94916",PMCL,1876,69142,"Parkash Gera",1985- +01-01 00:00:00,0,NULL,"0143000000","INFO@MAXHEALTHCARE.COM",OP,NO,Ver +ified,2016-10-21 19:11:24,2016-10-21 19:11:24,Lab,773211,0,"",M

Requirement: Sort file based on 3 columns in file i.e. primary: 31, secondary: 1, tertiary: 2 i.e. sorted as below, 1st -> 2nd -> 3rd

1,2,3 1,2,4 1,3,4 2,1,5

What I'm doing: i.) converting file to pipe(|) separated from comma(,) -> using Text::CSV module ii.) Sorting using File::Sort Here is the code snippet:

&commaToPipeDelimiter($maxFile, $pipeMaxFile); sort_file({t => '|', k => ['31n', '1n', '2n'], I => $pipeMaxFile, o => + $sortedMaxFile}); sub commaToPipeDelimiter{ my ($file, $pfile) = @_; my $csv = Text::CSV->new({binary => 1, decode_utf8 => 1, auto_diag + => 1, allow_loose_quotes => 1}); open(my $data, '<:encoding(utf8)', $file) or die "Could not open ' +$file' $!\n"; open(W, ">".$pfile) || die "Could not open $pfile $!\n"; while(my $line = <$data>){ chomp $line; if($csv->parse($line)){ my @fields = $csv->fields(); print W join("|",@fields),"\n"; } else{ warn "Line could not be parsed: $!\n"; } }

Is there some other efficient way someone can suggest? Rather than converting file into pipe separated file then sort since files could be much larger. FYI -> Embedded commas need to be taken care of.

-Chetan

Replies are listed 'Best First'.
Re: Need to sort comma delimited, double quoted file
by hippo (Bishop) on Jul 25, 2017 at 09:33 UTC
    Is there some other efficient way someone can suggest?

    Since you are already familiar with Text::CSV, why not use that to read the file into an Array of Arrays (AoA), sort that structure and then write it out to an output file?

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Need to sort comma delimited, double quoted file
by Tux (Canon) on Jul 25, 2017 at 13:03 UTC

    The are multiple examples freely available to do exactly that. One of the tools that enables you to do this from the command-line is xlscat as available in the Spreadsheet::Read distribution. That will (also) use Text::CSV_XS to parse your CSV data

    $ xlscat --sort=31n,1n,2n file.csv

    As your CSV however is not valid CSV (see ,"Max Smart Super, Speciality Hospital "O"", where the double quotes inside the field are not escaped), you might need to roll your own script.

    A really quick braindump that works on the snippet you showed is:

    use 5.18.2; use warnings; use Data::Peek; use Text::CSV_XS qw( csv ); my $aoa = csv (in => "test.csv", escape => undef, allow_loose_quotes = +> 1); csv (in => [ sort { $a->[30] <=> $b->[30] || $a->[0] <=> $b->[0] || $a->[1] <=> $b->[1] } @$aoa ]);

    Enjoy, Have FUN! H.Merijn
      I tried the same logic to get my job done as found somewhere but used Text::CSV. Thanks for help Merijn!!
      my @sorted = sort {$a->[30] <=> $b->[30] || $a->[0] <=> $b->[0] || $a- +>[1] <=> $b->[1]} @$ref;

      Moreover, I'm stuck with other things now: i.) The script runs on my local machinewindows, strawberry perl, but fails on linux servers; looks like the files I'm getting are created on windows. Error captured is:: Line could not be parsed: Inappropriate ioctl for device2032EIF - CR char inside unquoted, not part of EOL12645948 which is because of Carriage return, can't see that with 'vi'. How can I handle this, any help?

      ii.) And yes file has inner double quotes in fields as given above; so get below error while executing this; can we handle this?: Line could not be parsed: Inappropriate ioctl for device2032EIF - CR char inside unquoted, not part of EOL12645948 at ./max_sort.pl line 69, <$data> line 60. # CSV_XS ERROR: 2023 - EIQ - QUO character not allowed

      my ($file, $sfile) = @_; my $csv = Text::CSV->new({binary => 1, decode_utf8 => 1, auto_ +diag => 1, allow_loose_quotes => 1}); open(my $data, '<:encoding(utf8)', $file) or die "Could not op +en '$file' $!\n"; my $i = 0; my $ref; my $lineCt = 1; my @header; while(my $line = <$data>){ chomp $line; if($lineCt == 1){ @header = $csv->fields(); $lineCt++; next; } if($csv->parse($line)){ my @fields = $csv->fields(); $ref->[$i] = \@fields; $i = $i+1; } else{ warn "Line could not be parsed: $!",$csv->erro +r_diag; } }

        That code would fail if *any* of the lines contains embedded newlines.

        I'd simplify the code to this, assuming the header has no embedded newlines:

        my $csv = Text::CSV_XS->new ({ binary => 1, decode_utf8 => 1, auto_diag => 1, allow_loose_quotes => 1, }); open my $data, "<:encoding(utf-8)", $file or die "$file: $!\n"; my $ref; my @header = $csv->header ($data); while (my $row = $csv->getline ($data)) { push @$ref, $row; }

        In CSV don't let perl deal with the EOL. In Text::CSV and Text::CSV_XS the differences between Windows EOL and Linux EOL aur fully automatically dealt with within the definition of legal CSV.

        I cannot tell anything about inappropriate ioctl calls regarding to CSV parsing. Neither on Linux nor on Windows. If the error is reproducable, I'd need the CSV file in full (preferably in a ZIP to ensure no binary conversions take place).


        Enjoy, Have FUN! H.Merijn
Re: Need to sort comma delimited, double quoted file
by salva (Canon) on Jul 25, 2017 at 11:00 UTC
    since files could be much larger

    Exactly, how large? The important thing here is to know if the data would fit into your computer RAM or not when expanded into Perl data structures.

    That would tell you if you need to use an internal or an external sorting algorithm.

    Update: BTW, use Text::CSV_XS instead of Text::CSV.

      My understanding is that Text::CSV is just a wrapper for Text::CSV_XS now so no need for people to worry about which one to use. If you are on a platform that makes XS modules difficult to install then you can install the pure Perl version and let Text::CSV fall back to that. The advantage is that using Text::CSV makes your program more portable.

      DESCRIPTION

      Text::CSV is a thin wrapper for Text::CSV_XS-compatible modules now. All the backend modules provide facilities for the composition and decomposition of comma-separated values. Text::CSV uses Text::CSV_XS by default, and when Text::CSV_XS is not available, falls back on Text::CSV_PP, which is bundled in the same distribution as this module.
Re: Need to sort comma delimited, double quoted file
by Lotus1 (Vicar) on Jul 25, 2017 at 13:05 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1195965]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-04-25 17:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found