Deleting duplicate lines from file

Win has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Deleting duplicate lines from file by marto (Cardinal) on Feb 16, 2006 at 11:48 UTC
Win, I find myself starting most of my replies to your nodes with something like 'Did you Super Search this topic before posting?'. Nodes like duplicate lines look promising if you bother to read it, there are many others you could find if you bother to search for them. You have been a user here for quite some time, you have been pointed towards reading the documentation and using this sites fantastic Super Search facility many times by many different monks. Please start taking this advice. Martin	[reply]
Re^2: Deleting duplicate lines from file by holli (Abbot) on Feb 16, 2006 at 12:09 UTC
Indeed. he/she/it's on my "don't reply" list since quite a while. Simply don't bother. holli, /regexed monk/	[reply] [d/l]
Re^2: Deleting duplicate lines from file by blazar (Canon) on Feb 17, 2006 at 08:44 UTC
Not only this is likely to have been discussed here before, but it is also a FAQ. He may easily get at the relevant entry by simply typing the reasonable guess perldoc -q duplicate.	[reply]
Re: Deleting duplicate lines from file by bart (Canon) on Feb 16, 2006 at 11:49 UTC
I think you're asking for the Unix tool uniq. There's a Pure Perl implementation included in the Perl Power Tools set (AKA "PPT"): uniq	[reply]
Re^2: Deleting duplicate lines from file by jcc (Sexton) on Feb 16, 2006 at 21:17 UTC
Probably... but remember that uniq removes consecutive dupes versus all dupes at any point. (not sure which the op was after) `$ cat afile he she he he $ uniq afile he she he` [download]	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Deleting duplicate lines from file by turo (Friar) on Feb 16, 2006 at 11:51 UTC
`#!/usr/bin/perl use strict; use Digest::MD5 qw(md5); my (@lines, %line_md5); while (<>) { unless (/^\s$/) { my $digest = md5($_); unless ( exists $line_md5{$digest} ) { $line_md5{$digest} = 1; push @lines, $_; } } else { push @lines, $_; } } print @lines;` [download] If you need to order your code: `cat file \| sort \| uniq` [download] Hope that helps :-) update* oops, i forget to say: supersearch is your friend ... `perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'`	[reply] [d/l] [select]
Re^2: Deleting duplicate lines from file by blazar (Canon) on Feb 17, 2006 at 11:25 UTC
While I often use (md5) sums, I think that this is an overkill for checking duplicate lines; and as usual exposes to the risk of false positives while for reasonably sized lines, which are to be expected in this case, it is quite reasonable to assume that the md5sum will have a size comparable to that of the string itself, or -depending on the actual data- even larger. Also, the code seems just a little bit too verbose for may tastes. Without that verbosity adding to readability, that is. However they're just tastes, so I won't insist too much on this point. Last, if one needs to print non-duplicate lines, it's pointlessly resources-consuming to gather them into an array to print them all together. Granted, this may be an illustration for a more general situation in which one may actually need to store all of them in one place. But the OP is clearly a newbie and I fear that doing so in this case would risk being cargo culted into the bad habit of unnecessarily assigning to unnecessary variables all the time. Oh, and the very last thing about your suggestion: `cat file \| sort \| uniq` [download] The following is just equivalent: `sort -u file` [download]	[reply] [d/l] [select]
Re^3: Deleting duplicate lines from file by turo (Friar) on Feb 17, 2006 at 18:35 UTC
Okay, i'll take my armor and my shield, and no axes (today i'm friendly) Why i use md5 digest hash instead of lines? It will be more easy to put into the hash the line, and then see if that line exist or not. But, the hash will grow too much in memory (i didn't know the size of the victim file, but i expected it to be too long) False positives. I don't believe that. MD5 digest is 16 bytes long, there are 2^(168) posible md5 digests. The probability to have a false positive is 1/2^(168) (i'm not mathematician, but I think it so). Its difficult to find two lines on a file with the same hash... Okay, if we want to be purists, should i add to the comparisson the number of characters of the line and the md5 digest. "is quite reasonable to assume the md5sum will have a size comparable to that of the string itself" i didn't take this assumption, and maybe you have the reason at this point (Win didn't say anything about this particular)... about my code ... i didn't wanted to do it obscure ... i wanted it to be understandable ... :'( if one needs to print non-duplicate lines ... okay, i did the code in a minute, trying to reply the question the most quickly as i could ... I solved the problem, so thats was enough for me in that moment... Okay, i assume that i didn't need to use the array, here i was wrong... `#!/usr/bin/perl use strict; use Digest::MD5 qw(md5); my %line; while (<>) { my $digest = md5($_); unless ( exists $line{$digest} and $line{$digest} == length ) +{ $line{$digest} = length; print; } }` [download] Last of all, its nice to receive constructive critics ... turo PS: thanks for the abbreviation (`sort -u file`) i didn't knew it :-) `perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'`	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Deleting duplicate lines from file by spiritway (Vicar) on Feb 16, 2006 at 23:51 UTC
If the order of the lines is unimportant, sort them first, then examine them in order, testing whether the 'next' line matches the current line. Only print the lines where this is not true (there is no match).	[reply]
Re: Deleting duplicate lines from file by thundergnat (Deacon) on Feb 17, 2006 at 01:12 UTC
Uncomment the appropriate print staement to get the output order you need. `use warnings; use strict; @ARGV or die "You need to supply a file name.\n"; open my $fh, '<', shift or die "$!\n"; my @lines = <$fh>; my %unique; @unique{@lines} = (1) x @lines; # unique lines #print keys %unique; # unique sorted lines #print sort keys %unique; # unique lines in order seen in original file do { print if delete $unique{$_} } for @lines;` [download]	[reply] [d/l]
Re^2: Deleting duplicate lines from file by blazar (Canon) on Feb 17, 2006 at 13:15 UTC
`@unique{@lines} = (1) x @lines;` [download] Also `@unique{@lines} = ();` [download] since you don't use the values anyway. Whatever, if he wants them in the original order, then slurping the whole file in at once is, as is commonly the case, an overkill, and I would regard the usual `print if !$seen{$_}++` technique as a superior solution. Of course if one needs or may need sorting then the slurping must take place in some form or another and yours is just as good as any other one. Probably you already knew, I'm just pinpointing some details for the benefit of the OP...	[reply] [d/l] [select]
Re^3: Deleting duplicate lines from file by thundergnat (Deacon) on Feb 17, 2006 at 14:54 UTC
Also @unique{@lines} = (); since you don't use the values anyway. Actually, it does need true values for each key or delete will return false. So, while it's true it doesn't use them, it does need them.	[reply]
Re^4: Deleting duplicate lines from file by blazar (Canon) on Feb 17, 2006 at 16:05 UTC
Re: Deleting duplicate lines from file by blazar (Canon) on Feb 17, 2006 at 09:28 UTC
Sure: `perl -ne 'print if !$saw{$_}++' perl -pe'$s{$_}++and$_=$,'` [download]	[reply] [d/l]
Re^2: Deleting duplicate lines from file by turo (Friar) on Feb 18, 2006 at 12:51 UTC
I like this minimalistic bit of code :-) ... You win! `perl -pe'$_{$_}++and$_=$,'` [download] `perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'`	[reply] [d/l] [select]


"be consistent"
	PerlMonks