Remove lines that contain matching values from csv.

by urbs33 (Novice)
on Oct 08, 2012 at 15:56 UTC
urbs33 has asked for the wisdom of the Perl Monks concerning the following question:

Perl Monks,

I am looking for some input on filtering a csv to unique records based on a specified field. I have a list of records that may exist on multiple servers. I am generating a list of the files, and some additional information about them, including the server that they reside on. I only want to output one record of the file, even if it is one multiple servers. Let's suppose that the field to match is field one. I will have something like this.,


There are thousands of lines in this file. If one fo the files from this report is copied to another server, the unique_names will match, but the server name will not. I only want one record of each unique_name considered int his report and do not care from which server name. This is a unix OS, so if there is an easier way to do it with awk, sort|uniq, or other native commands, I am open to that. I'm just kinda stumped since the rest of the line will not match exactly.

Re: Remove lines that contain matching values from csv.
by Anonymous Monk on Oct 08, 2012 at 16:01 UTC
Re: Remove lines that contain matching values from csv.
by BrowserUk (Pope) on Oct 08, 2012 at 16:09 UTC

    perl -F, -anle"++$uniq{$F[0]} == 1 and print" infile > outfile

Re: Remove lines that contain matching values from csv.
by fluffyvoidwarrior (Monk) on Oct 09, 2012 at 15:52 UTC
    Perhaps I'm missing the point of your question but.....

    If the fields are ordered as you seem to suggest and the unique id is in position 1 (so you can short circuit for speed - otherwise you'd have to regex a whole line - slower than anchoring at start or substr) can't you just treat the csv files as text files, ie a bunch of arrays. Parse each one and compare position 1 (the id field) with a cumulative output array for uniqueness. So long as the output is less than about 100,000 not-huge lines Perl should do this in a few seconds per input file (if you've optimised your code).

    For obvious reasons simple textfile handling is a lot faster than using CSV libraries)

    I wouldn't know how to do this with a one-liner but I don't see why you would have to. Maybe it's not the neatest of solutions but it is a guaranteed, self contained solution for half hours work, that other people will easily understand in the future.

