Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Sort CSV file within Excel based on specific column

by jaacmmason (Acolyte)
on Oct 11, 2012 at 19:57 UTC ( #998526=perlquestion: print w/replies, xml ) Need Help??
jaacmmason has asked for the wisdom of the Perl Monks concerning the following question:

Good afternoon. I have a CVS file I want to open up within Excel, sort the file (anywhere from 1000 - 6000 lines of data) based on Column 'F' and then delete all rows with specific data in Column 'F'. I am only to the point where I can open the file and that is it. The file will not sort for me. My code is short and simple, and does not work how I want it to work. Suggestions would be greatly appreciated. I will end up adding much more to this file after sorting, but I need to get past this point first. Here is my code;

#!perl use strict; use warnings; use Win32::OLE; use Win32::OLE qw(in with); use Win32::OLE::Variant; use Win32::OLE::Const 'Microsoft Excel'; use File::Copy; $Win32::OLE::Warn = 3; # Die on Errors. #Open the needed file from server location my $excelfile = '\\\\ServerName\\FolderName\\FolderName\\FolderName\\F +ileName.csv'; my $Excel = Win32::OLE->GetActiveObject('Excel.Application') || Win32::OLE->new('Excel.Application', 'Quit'); #Open the file needed and activate the worksheet for manipulation ~ Tr +y to get a input box for entry of Branch name to open correct file... my $Book = $Excel->Workbooks->Open($excelfile); my $Sheet = $Book->Worksheets(1); $Sheet->Activate(); #Sort the file based on "Call Type" Column (F is column number 5, as A + is 0) $Excel->sort_data('WorksheetName',5,'DESC');

I have even attempted to use;

#Find Last Column and Row my $LastRow = $Sheet->UsedRange->Find({What=>"*", SearchDirection=>xlP +revious, SearchOrder=>xlByRows})->{Row}; my $LastCol = $Sheet->UsedRange->Find({What=>"*", SearchDirection=>xlP +revious, SearchOrder=>xlByColumns})->{Column};

to then try a loop type of statement. Yeah, that did not work either. Thank you for any suggestions offered I have a LONG way to go to my end results, and if I am having this many issues with a sort, I am worried about the rest of the requirments.

Replies are listed 'Best First'.
Re: Sort CSV file within Excel based on specific column
by Tanktalus (Canon) on Oct 12, 2012 at 01:02 UTC

    When I see "CSV", I think DBD::CSV. Sorting? That's just ORDER BY. At that point, it's pretty much done.

    Mind you, I wouldn't normally sort a CSV file (at least on purpose, I've got many CSV files that are sorted naturally by date because they're inserted that way). I'd leave it as-is, and when I go to use it, I'd again just SELECT my,col,names FROM TABLE ORDER BY col - because next time I might want to ORDER BY names, and really, why waste time sorting it on disk when I'm going to be explicit each time anyway? (Ok, there may be performance reasons, but not usually an issue.)

    No reason to bring up Excel :-)

      My file looks similar to the following:

      Item,Call Start Time,Calling Number,Called Number,Call Length,Call Tim +e,Billable,Dept,Time Zone 1,Sat Sep 1 08:13:00 2012,(815)444-4444,(815)626-6262,0.4,N/A,Inbound, +N, HR,GMT-5 2,Sat Sep 1 08:13:00 2012,(815)626-6262,(815)950-0000,0.4,Intragroup,O +utbound,N,HR,GMT-5 3,Sat Sep 1 09:04:00 2012,(224)555-9999,(815)626-6262,4.3,N/A,Inbound, +N,HR,GMT-5 4,Sat Sep 1 09:04:00 2012,(815)626-6262,(815)950-0000,4.3,Intragroup,O +utbound,N,HR,GMT-5 5,Sat Sep 1 09:54:00 2012,(815)441-8383,(815)626-6262,0.5,N/A,Inbound, +N,HR,GMT-5 6,Sat Sep 1 09:54:00 2012,(815)626-6262,(815)950-0000,0.5,Intragroup,O +utbound,N,HR,GMT-5

      Then I want to take this data (24 separate files with as many as 7000 lines of data in each file, each month) and remove all the "outbound" Call Direction rows, as this is essentially a duplicate field. This will cut our file in half. At this point I need to do more manipulation in order to break up the Call Start Time into 3 separate fields, and to delete three other columns.

      I thought if I could manipulate the entire thing using Perl and the WIN32::OLE options to do this. Can you answer me if this is all possible using Text::CSV, rather than WIN32::OLE? This is the first time I have used TEXT::CSV, so I am not sure if all my needed functionally can be achieved.

      Recommendations on the easiest way for me to achieve what my end goal is would be great! I will have to research either way, as I still classify myself as a Perl newbie.

        With Text::CSV_XS, you would do (roughly):

        1. Read the first line via Text::CSV_XS, ignore it, and write your new first line (with all the adjusted column names) to your output file, via a second Text::CSV_XS object (I think).
        2. Loop:
          1. Read next line via Text::CSV_XS.
          2. Discard "Called Number" (see splice). e.g., splice @row, 3, 1;
          3. Discard other three columns (more splice). (I can't tell you how to do this, I don't know which columns to remove - if they're all one after another, this could be one call, or if they're all separate, it may be multiple calls. If they're the three after called number, you could combine this with the above by splice @row, 3, 4;, but I don't expect that to be the case)
          4. Manipulate Call Start Time ($row[1]). Probably extract what you need via a regex or two. Use splice to put them back in place: splice @row, 1, 1, @new_call_start_time_columns; (this assumes you want the three new columns to be in the same place as the one old column)
          5. Write out the new @row to the output file via the second Text::CSV_XS object.
        3. ???
        4. Profit!
        This produces the output in the same order as the input, which is definitely the easiest. Alternatively, instead of writing out the new @row, just save it to another array, push @all_rows, \@row; (which means you must declare my @row inside the loop, not outside), and when you're done, sort them and then loop through that to spit everything out to your output file.

        I also recommend that, if possible, and it isn't always possible, you have your input files and output files in separate directories. Makes it easier to wipe out all of the output files if there's a coding error and you want to modify and re-run.

        A second option is to use DBD::CSV as I said earlier. The challenge here is that you will be both reading and creating CSV files through a database interface. Definitely possible, but probably a bit more work to set up. As I mentioned in the linked-to article, I like this solution because it makes me think in SQL, where the "S" means "Structured". A side effect is that, as mentioned earlier, you can just ORDER BY on the initial query and let SQL::Statement and friends handle all the heavy work for you, and you can just deal with it at the other end.

        Tanktalus provided excellent directions on how to achieve your goal. With just a little more work, the script I've shown you can accomplish these items. For example:

        while ( my $row = $csv->getline($csvfh) ) { next if $row->[6] eq 'Outbound'; push @csvLines, $row; }

        will skip all lines containing 'Outbound' in your shown data set. You can manipulate the row data right after next, to get what you need before pushing the line onto @csvLines.

        How do you want to separate Call Start Time and which three columns do you want to delete?

Re: Sort CSV file within Excel based on specific column
by Kenosis (Priest) on Oct 11, 2012 at 21:15 UTC

    I agree with runrig, that you don't need Excel to sort your csv file. Consider the following:

    FileName.csv file contents before (adapted from here):

    REVIEW_DATE,AUTHOR,ISBN,DISCOUNTED_PRICE 1985/01/21,"Douglas Adams",0345391802,5.95 1990/01/12,"Douglas Hofstadter",0465026567,9.95 1998/07/15,"Timothy ""The Parser"" Campbell",0968411304,18.99 1999/12/03,"Richard Friedman",0060630353,5.95 2001/09/19,"Karen Armstrong",0345384563,9.95 2002/06/23,"David Jones",0198504691,9.95 2002/06/23,"Julian Jaynes",0618057072,12.50 2003/09/30,"Scott Adams",0740721909,4.95 2004/10/04,"Benjamin Radcliff",0804818088,4.95 2004/10/04,"Randel Helms",0879755725,4.50

    The script which numerically sorts the above csv on the last field:

    use strict; use warnings; use Text::CSV; my ( $csvFileName, @csvLines ) = 'FileName.csv'; my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 2 } ) or die "Cannot use CSV: " . Text::CSV->error_diag(); open my $csvfh, '<', $csvFileName or die $!; # Get first line (array reference to parsed column names) my $columnNames = $csv->getline($csvfh); # $row contains an array reference to the parsed csv line while ( my $row = $csv->getline($csvfh) ) { push @csvLines, $row; } close $csvfh; # $a->[3] dereferences the array reference to get the third element @csvLines = sort { $a->[3] <=> $b->[3] } @csvLines; # Add column names array reference to beginning of array unshift @csvLines, $columnNames; $csv->eol("\n"); # Print the sorted csv lines to a file open $csvfh, '>', "sorted_$csvFileName" or die $!; $csv->print( $csvfh, $_ ) for @csvLines; close $csvfh;

    Results written to file sorted_FileName.csv:

    REVIEW_DATE,AUTHOR,ISBN,DISCOUNTED_PRICE 2004/10/04,"Randel Helms",0879755725,4.50 2003/09/30,"Scott Adams",0740721909,4.95 2004/10/04,"Benjamin Radcliff",0804818088,4.95 1985/01/21,"Douglas Adams",0345391802,5.95 1999/12/03,"Richard Friedman",0060630353,5.95 1990/01/12,"Douglas Hofstadter",0465026567,9.95 2001/09/19,"Karen Armstrong",0345384563,9.95 2002/06/23,"David Jones",0198504691,9.95 2002/06/23,"Julian Jaynes",0618057072,12.50 1998/07/15,"Timothy ""The Parser"" Campbell",0968411304,18.99

    You'll need to change the following:

    { $a->[3] <=> $b->[3] }

    to (at least):

    { $a->[5] <=> $b->[5] }

    If column F isn't numeric, change the <=> to cmp.

    Hope this helps!

    Edit: Script modified to output to file. Also removed the or die $! from--and shortened--the $csv->print (thank you, Tux).

      When using auto_diag => 2, or die … is unneeded. It will happen before inside Text::CSV or Text::CSV_XS, as they use the underlying print.

      for my $csvLine (@csvLines) { $csv->print( $csvfh, $csvLine ) or die $!; } -> $csv->print ($csvfh, $_) for @csvLines;

      Enjoy, Have FUN! H.Merijn

        Thank you, Tux. Have modified the script.

      I was looking for this, and is working exactly what i need thank you so much
Re: Sort CSV file within Excel based on specific column
by runrig (Abbot) on Oct 11, 2012 at 20:34 UTC
    No need to open a CSV file in Excel. Use Text::CSV/Text::CSV_XS. Read read the lines, put then in a hash keyed by your sort column, sort the keys, and print the lines.

      ...put then in a hash keyed by your sort column...

      Done this way, wouldn't the OP lose a csv row if two entries from the sort column are the same?

        Yes, I suppose the hash is unnecessary.
Re: Sort CSV file within Excel based on specific column
by davies (Parson) on Oct 12, 2012 at 20:54 UTC

    As others have implied, Excel may not be the best tool for you. But assuming it is (say you want to send summaries to managers who won't look at anything else), the following code should do the job, subject to a few points:

    In Re^2: Sort CSV file within Excel based on specific column, you talk about removing Outbound calls, but in the OP you talk about sorting on column F, which contains the call time. As this is N/A in all cases in your example data, I have refrained from sorting. Please let us know if I have misunderstood.

    I have used Advanced Filter to do the deletion as it is much faster, but if you want to delete line by line for some other reason, please see RFC Tutorial - Deleting Excel Rows, Columns and Sheets.

    use strict; use warnings; use Win32::OLE; use Win32::OLE::Const 'Microsoft Excel'; my $filename = 'Z:\Data\Perl\998526\998682.csv'; my $xl = Win32::OLE->new('Excel.Application'); $xl->{Visible} = 1; my $wb = $xl->Workbooks->Open($filename); my $sht = $wb->Sheets(1); $sht->Range('A1:A3')->EntireRow->Insert; $sht->Range('G1')->{Value} = $sht->Range('G4')->{Value}; $sht->Range('G2')->{Value} = 'Outbound'; my $lastcell = $xl->ActiveCell->SpecialCells(xlCellTypeLastCell)->Addr +ess; $sht->Range('A4:' . $lastcell)->AdvancedFilter ({Action => xlFi +lterInPlace, CriteriaRange => $sht +->Range('G1:G2'), Unique => 0}); $xl->{DisplayAlerts} = 0; $sht->Range('A5:' . $lastcell)->Delete; $xl->{DisplayAlerts} = 1; $sht->ShowAllData; $sht->Range('A1:A3')->EntireRow->Delete;

    A few points. I strongly advise against taking control of an existing instance of Excel. I have written a few posts here on the subject. It seems to be a technique widely copied from something I can't remember reading, but if there are many more examples, I'll put up a rantmeditation as a single point of reference.

    When using single quotes, you don't need to use multiple backslashes.

    Your technique to find the last row will work for your case, but is very inefficient. I can dream up some cases where it might not work (I haven't tried).

    Don't ->Select or ->Activate. See Excel’s Select and Activate considered harmful. It's fine to read these, but changing them is rarely necessary.


    John Davies

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://998526]
Approved by Corion
[marioroy]: choroba++, Discipulus++. It depends on the type of module. Data-type "only" modules are likely multi-process safe, re: Hash::Ordered, Tie::IxHash.
[marioroy]: ... when shared via MCE::Share-> share(...)
[marioroy]: Net type modules are likely not multi-process safe unless stated in the documentation.
[marioroy]: The Prima author fixed his module to be both thread and multi-process safe. Thanks Dmitry.

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2017-09-22 10:11 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (260 votes). Check out past polls.