Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Deleting duplicate lines from file

by Win (Novice)
on Feb 16, 2006 at 11:40 UTC ( [id://530636]=perlquestion: print w/replies, xml ) Need Help??

Win has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re: Deleting duplicate lines from file
by marto (Cardinal) on Feb 16, 2006 at 11:48 UTC
    Win,

    I find myself starting most of my replies to your nodes with something like 'Did you Super Search this topic before posting?'.
    Nodes like duplicate lines look promising if you bother to read it, there are many others you could find if you bother to search for them.

    You have been a user here for quite some time, you have been pointed towards reading the documentation and using this sites fantastic Super Search facility many times by many different monks. Please start taking this advice.

    Martin
      Indeed. he/she/it's on my "don't reply" list since quite a while. Simply don't bother.


      holli, /regexed monk/

      Not only this is likely to have been discussed here before, but it is also a FAQ. He may easily get at the relevant entry by simply typing the reasonable guess perldoc -q duplicate.

Re: Deleting duplicate lines from file
by bart (Canon) on Feb 16, 2006 at 11:49 UTC
      Probably... but remember that uniq removes consecutive dupes versus all dupes at any point. (not sure which the op was after)
      $ cat afile he she he he $ uniq afile he she he
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Deleting duplicate lines from file
by turo (Friar) on Feb 16, 2006 at 11:51 UTC

    #!/usr/bin/perl use strict; use Digest::MD5 qw(md5); my (@lines, %line_md5); while (<>) { unless (/^\s*$/) { my $digest = md5($_); unless ( exists $line_md5{$digest} ) { $line_md5{$digest} = 1; push @lines, $_; } } else { push @lines, $_; } } print @lines;
    If you need to order your code:
    cat file | sort | uniq

    Hope that helps :-)

    update oops, i forget to say: supersearch is your friend ...

    perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'

      While I often use (md5) sums, I think that this is an overkill for checking duplicate lines; and as usual exposes to the risk of false positives while for reasonably sized lines, which are to be expected in this case, it is quite reasonable to assume that the md5sum will have a size comparable to that of the string itself, or -depending on the actual data- even larger.

      Also, the code seems just a little bit too verbose for may tastes. Without that verbosity adding to readability, that is. However they're just tastes, so I won't insist too much on this point.

      Last, if one needs to print non-duplicate lines, it's pointlessly resources-consuming to gather them into an array to print them all together. Granted, this may be an illustration for a more general situation in which one may actually need to store all of them in one place. But the OP is clearly a newbie and I fear that doing so in this case would risk being cargo culted into the bad habit of unnecessarily assigning to unnecessary variables all the time.

      Oh, and the very last thing about your suggestion:

      cat file | sort | uniq

      The following is just equivalent:

      sort -u file

        Okay, i'll take my armor and my shield, and no axes (today i'm friendly)

        1. Why i use md5 digest hash instead of lines? It will be more easy to put into the hash the line, and then see if that line exist or not. But, the hash will grow too much in memory (i didn't know the size of the victim file, but i expected it to be too long)
        2. False positives. I don't believe that. MD5 digest is 16 bytes long, there are 2^(16*8) posible md5 digests. The probability to have a false positive is 1/2^(16*8) (i'm not mathematician, but I think it so). Its difficult to find two lines on a file with the same hash... Okay, if we want to be purists, should i add to the comparisson the number of characters of the line and the md5 digest.
        3. "is quite reasonable to assume the md5sum will have a size comparable to that of the string itself" i didn't take this assumption, and maybe you have the reason at this point (Win didn't say anything about this particular)...
        4. about my code ... i didn't wanted to do it obscure ... i wanted it to be understandable ... :'(
        5. if one needs to print non-duplicate lines ... okay, i did the code in a minute, trying to reply the question the most quickly as i could ... I solved the problem, so thats was enough for me in that moment... Okay, i assume that i didn't need to use the array, here i was wrong...
          #!/usr/bin/perl use strict; use Digest::MD5 qw(md5); my %line; while (<>) { my $digest = md5($_); unless ( exists $line{$digest} and $line{$digest} == length ) +{ $line{$digest} = length; print; } }

        Last of all, its nice to receive constructive critics ...

        turo

        PS: thanks for the abbreviation (sort -u file) i didn't knew it :-)

        perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Deleting duplicate lines from file
by spiritway (Vicar) on Feb 16, 2006 at 23:51 UTC

    If the order of the lines is unimportant, sort them first, then examine them in order, testing whether the 'next' line matches the current line. Only print the lines where this is not true (there is no match).

Re: Deleting duplicate lines from file
by thundergnat (Deacon) on Feb 17, 2006 at 01:12 UTC

    Uncomment the appropriate print staement to get the output order you need.

    use warnings; use strict; @ARGV or die "You need to supply a file name.\n"; open my $fh, '<', shift or die "$!\n"; my @lines = <$fh>; my %unique; @unique{@lines} = (1) x @lines; # unique lines #print keys %unique; # unique sorted lines #print sort keys %unique; # unique lines in order seen in original file do { print if delete $unique{$_} } for @lines;
      @unique{@lines} = (1) x @lines;

      Also

      @unique{@lines} = ();

      since you don't use the values anyway. Whatever, if he wants them in the original order, then slurping the whole file in at once is, as is commonly the case, an overkill, and I would regard the usual print if !$seen{$_}++ technique as a superior solution. Of course if one needs or may need sorting then the slurping must take place in some form or another and yours is just as good as any other one. Probably you already knew, I'm just pinpointing some details for the benefit of the OP...

        Also

        @unique{@lines} = ();

        since you don't use the values anyway.

        Actually, it does need true values for each key or delete will return false. So, while it's true it doesn't use them, it does need them.

Re: Deleting duplicate lines from file
by blazar (Canon) on Feb 17, 2006 at 09:28 UTC
    Sure:
    perl -ne 'print if !$saw{$_}++' perl -pe'$s{$_}++and$_=$,'

      I like this minimalistic bit of code :-) ... You win!

      perl -pe'$_{$_}++and$_=$,'

      perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://530636]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (2)
As of 2024-04-20 01:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found