Remove Duplicate Lines

dcb0127 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Remove Duplicate Lines by perrin (Chancellor) on Sep 18, 2002 at 18:57 UTC
To remove dupes you put things into a hash: `my %unique = @data;` But honestly, there's no reason to use Perl for this. Just `sort -u file > new_file`	[reply] [d/l] [select]
Re: Remove Duplicate Lines by rbc (Curate) on Sep 18, 2002 at 19:01 UTC
If you are on a Unix system that has a sort command that supports the -u switch ( -u implies Unique ) you could do this `$ sort -u db.txt > outdb.txt` [download] I know that Solaris's sort has -u and Red Hats cygwin does not. I hope that helps :)	[reply] [d/l]
Re: Remove Duplicate Lines by BrowserUk (Patriarch) on Sep 18, 2002 at 21:45 UTC
If purchance it is necessary for you to keep the unique lines of your file in the same order, then this will remove all but the first occurance of each line and leave the remaining ones in their original order. Just redirect the output to a new file on the command line (and uncomment the open line). `#! perl -sw use strict; my %lines; #open DATA, $ARGV[0] or die "Couldn't open $ARGV[0]: $!\n"; while (<DATA>) { print if not $lines{$_}++; } __DATA__ this is a line this is another line yet another and yet another still this is a line more and more and even more this is a line and this and that but not the other cos its a family website:)` [download] Gives `C:\test>uniq this is a line this is another line yet another and yet another still more and more and even more and this and that but not the other cos its a family website:) C:\test>` [download] The caveat of course is that with a large file, that hash could get mind of big, but maybe that's ok if this is what you need to do. Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!	[reply] [d/l] [select]
Re: Remove Duplicate Lines by tstock (Curate) on Sep 18, 2002 at 19:01 UTC
If this is a one time fix, you can use the unix commands 'sort' and 'uniq', like so: `sort db.txt \| uniq > newfile.db` [download] 'sort' sorts the entries in db.txt, uniq then cuts out duplicates and the the '>' redirects the output to a new file name instead of the screen (STDOUT). tstock	[reply] [d/l]
Re: Remove Duplicate Lines by fglock (Vicar) on Sep 18, 2002 at 19:04 UTC
`for(@sd){ push @out, $_ if (not @out) or ($out[-1] ne $_); };` [download] update: added (not @out) to initialize array	[reply] [d/l]
Re^2: Remove Duplicate Lines by Aristotle (Chancellor) on Sep 18, 2002 at 19:38 UTC
`@out = grep $sd[$_] ne $sd[1+$_], 0 .. $#sd;` With inline sort and a perverse twist: `my $prev; @out = grep "$prev" ne ($prev = $_), sort @sd; # not for production co +de` [download] Makeshifts last the longest.	[reply] [d/l] [select]
Re: Remove Duplicate Lines by Anonymous Monk on Aug 01, 2019 at 14:36 UTC
if you want to remove duplicates from a data file and the file has a header #!/usr/bin/perl $ifile=$ARGV[0]; $ofile=$ARGV[1]; $header=`sed -n '1p' $ifile` ; $data=`sed '1d' $ifile \| sort -u` ; open(my $fh, '>', $ofile) or die "Could not open file '$ofile' $!"; print $fh $header; print $fh $data; close $fh; exit 0 [download]	[reply] [d/l]
Re^2: Remove Duplicate Lines by afoken (Chancellor) on Aug 01, 2019 at 19:47 UTC
Let's see: `use strict` missing `use warnings` missing Missing my for $ifile, $ofile, $header, $data. no check that the program is called with the correct number of arguments Forking a shell (1) via qx (``) begs for trouble - see Improve pipe open? ... to run sed, just to read the first line of a file ... while making sed read the entire file ... and ignoring all quoting issues by simply not quoting at all - see The problem of "the" default shell ... and ignoring the fact that sed is not available by default on Windows and other operating systems Forking another shell via qx to pipe sed output to sort -u input ... again without any qouting ... again assuming sed is available everywhere ... assuming a POSIX sort is available everywhere. DOS/Windows sort does not understand -u and can't sort and filter out dupes ... reading the entire output of sort -u into memory ... just to write it out again three lines later And finally, `exit 0` is redundant This is highly inefficient and has several issues with "interesting" filenames. In Re: Remove Duplicate Lines, BrowserUk explains how to use perl properly. Another option - if running on a POSIX compatible system - is to use sort properly. Without headers, it is trivial: `sort -u < inputfile > outputfile` [download] With headers, this will do: `head -n 1 inputfile > outputfile sed '1d' inputfile \| sort -u >> outputfile` [download] This way, head can stop processing the input file after the first line, unlike `sed -n '1p'`. Directly writing to the outputfile avoids all further overhead of your script. Alexander (1) yes, given a sane filename, perl may start the first sed without help of the default shell. Change the filename to something interesting and perl will start sed via the default shell. -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]


No such thing as a small change
	PerlMonks