Beefy Boxes and Bandwidth Generously Provided by pair Networks Cowboy Neal with Hat
go ahead... be a heretic
 
PerlMonks  

Re: Deleting duplicate lines from file

by turo (Friar)
on Feb 16, 2006 at 06:51 UTC ( [id://530640]=note: print w/replies, xml ) Need Help??

This is an archived low-energy page for bots and other anonmyous visitors. Please sign up if you are a human and want to interact.


in reply to Deleting duplicate lines from file

#!/usr/bin/perl use strict; use Digest::MD5 qw(md5); my (@lines, %line_md5); while (<>) { unless (/^\s*$/) { my $digest = md5($_); unless ( exists $line_md5{$digest} ) { $line_md5{$digest} = 1; push @lines, $_; } } else { push @lines, $_; } } print @lines;
If you need to order your code:
cat file | sort | uniq

Hope that helps :-)

update oops, i forget to say: supersearch is your friend ...

perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'

Replies are listed 'Best First'.
Re^2: Deleting duplicate lines from file
by Win (Novice) on Feb 16, 2006 at 10:47 UTC
    I am in the process of trying to get this code to work. Please could someone offer me a detailed explanation of how this works so I can fix problems that I am having with it.

    For example, I don't understand this line:
    my (@lines, %line_md5);
    or these lines:
    my $digest = md5($_); unless ( exists $line_md5{$digest} ) { $line_md5{$digest} = 1;

      my

      unless

      It looks like perldoc is down....so try these links:

      my

      unless (BTW does anyone know where perldoc.perl.org keeps the info about unless and other control structures?)

      You hurt me ...
      program file
      thats enough? ....

      i recommend you to read some perl guide
      perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'
      So how much are you getting paid for this one? How much do we get?

      It works nearly in the same exact way as the code you already asked about except that:

      1. the flow control is syntactically (but not logically!) different;
      2. it doesn't do a check on the actual strings, but on a checksum computed for them which gives a sufficient condition for two of them to be different. That is, if the checksums are different, then the strings will be different too, while the converse does not hold: different strings may have the same checksums (but we rely on the confidence that such occurrencies are rare enough), hence => my remarks

      Then you wrote:

      For example, I don't understand this line:
      my (@lines, %line_md5);

      Oh my! Please tell me you're joking!! I see that you have 204 writing as of now. That's quite surprising... maybe you're not really programming in Perl, but in some vaguely similar language. What is it precisely that you do not understand?

        The question is.

        Why use
        my (@lines, %line_md5);
        when you can use
        my @lines; my %line_md5:
        I don't really see that there is any point in grouping them together like that. I am thinking of putting a moto on my messages: Keep it simple when possible. Go complex when required.
Re^2: Deleting duplicate lines from file
by blazar (Canon) on Feb 17, 2006 at 06:25 UTC

    While I often use (md5) sums, I think that this is an overkill for checking duplicate lines; and as usual exposes to the risk of false positives while for reasonably sized lines, which are to be expected in this case, it is quite reasonable to assume that the md5sum will have a size comparable to that of the string itself, or -depending on the actual data- even larger.

    Also, the code seems just a little bit too verbose for may tastes. Without that verbosity adding to readability, that is. However they're just tastes, so I won't insist too much on this point.

    Last, if one needs to print non-duplicate lines, it's pointlessly resources-consuming to gather them into an array to print them all together. Granted, this may be an illustration for a more general situation in which one may actually need to store all of them in one place. But the OP is clearly a newbie and I fear that doing so in this case would risk being cargo culted into the bad habit of unnecessarily assigning to unnecessary variables all the time.

    Oh, and the very last thing about your suggestion:

    cat file | sort | uniq

    The following is just equivalent:

    sort -u file

      Okay, i'll take my armor and my shield, and no axes (today i'm friendly)

      1. Why i use md5 digest hash instead of lines? It will be more easy to put into the hash the line, and then see if that line exist or not. But, the hash will grow too much in memory (i didn't know the size of the victim file, but i expected it to be too long)
      2. False positives. I don't believe that. MD5 digest is 16 bytes long, there are 2^(16*8) posible md5 digests. The probability to have a false positive is 1/2^(16*8) (i'm not mathematician, but I think it so). Its difficult to find two lines on a file with the same hash... Okay, if we want to be purists, should i add to the comparisson the number of characters of the line and the md5 digest.
      3. "is quite reasonable to assume the md5sum will have a size comparable to that of the string itself" i didn't take this assumption, and maybe you have the reason at this point (Win didn't say anything about this particular)...
      4. about my code ... i didn't wanted to do it obscure ... i wanted it to be understandable ... :'(
      5. if one needs to print non-duplicate lines ... okay, i did the code in a minute, trying to reply the question the most quickly as i could ... I solved the problem, so thats was enough for me in that moment... Okay, i assume that i didn't need to use the array, here i was wrong...
        #!/usr/bin/perl use strict; use Digest::MD5 qw(md5); my %line; while (<>) { my $digest = md5($_); unless ( exists $line{$digest} and $line{$digest} == length ) +{ $line{$digest} = length; print; } }

      Last of all, its nice to receive constructive critics ...

      turo

      PS: thanks for the abbreviation (sort -u file) i didn't knew it :-)

      perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://530640]
help
Sections?
Information?
Find Nodes?
Leftovers?
    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.