Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
"be consistent"
 
PerlMonks  

Deleting duplicate lines from file

by Win (Novice)
on Feb 16, 2006 at 06:40 UTC ( [id://530636]=perlquestion: print w/replies, xml ) Need Help??

This is an archived low-energy page for bots and other anonmyous visitors. Please sign up if you are a human and want to interact.

Win has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Is there a module or a nice bit of code that will go through a file and substitute it for a new file with all the duplicate lines missing (when reading from top to bottom).

Replies are listed 'Best First'.
Re: Deleting duplicate lines from file
by marto (Cardinal) on Feb 16, 2006 at 06:48 UTC
    Win,

    I find myself starting most of my replies to your nodes with something like 'Did you Super Search this topic before posting?'.
    Nodes like duplicate lines look promising if you bother to read it, there are many others you could find if you bother to search for them.

    You have been a user here for quite some time, you have been pointed towards reading the documentation and using this sites fantastic Super Search facility many times by many different monks. Please start taking this advice.

    Martin
      Indeed. he/she/it's on my "don't reply" list since quite a while. Simply don't bother.


      holli, /regexed monk/

      Not only this is likely to have been discussed here before, but it is also a FAQ. He may easily get at the relevant entry by simply typing the reasonable guess perldoc -q duplicate.

Re: Deleting duplicate lines from file
by bart (Canon) on Feb 16, 2006 at 06:49 UTC
      my @clean=do { my %dupe; grep { !$dupe{$_}++ } @list };
      Is this one liner likely to be of any use? Because I couldn't explain it in English if I tried.

        Yes, it is likely to be of some use. That's precisely what you need. In English it parses like thus:

        • assign to @clean the return value of the last statement in the do block;
        • in the do block apply grep to @list. Do you know what grep is for? It will take a block (or an expression, but this is the "block form") and evaluate it for all the elements of the list it is passed to. The elements of the list are aliased to $_. If the block returns a true value for a particular value of $_ then that $_ is included in the return value of grep, else it is discarded;
        • In this case the block consists of a single statement, precisely !$dupe{$_}++. Now, $dupe{$_} is the value of the hash %dupe for the key $_. Thus $dupe{$_}++ is a counter for the occurrencies of $_: it will be 0 (false) on the first one and a number greater than zero (true) on the successive ones, and its negation !$dupe{$_}++ will be true for the first occurrence of $_ and false for the other ones.

        Anything else?

        So this seems to be exactly what you need. Except that your description suggests you don't really want to operate on lists and you don't need to assign to arrays. Thus all in all you could do something along the lines of:

        my %saw; while (<$in>) { print if !$saw{$_}++; }

        which can be compacted/golfed as in my other reply.

        UPDATE: speaking toungue-in-cheek - but previous experience tells me won't listen anyway: evidence is that your Perl knowledge is quite limited. So far so fine, nobody can impose to you to be an expert or to become one in a minute. But it is just as evident that you're routinely using perl to get some job done. In this case it should be recommendable to get acquainted with Perl's basic syntax, semantics, idioms. That is: asking here for help is fine and all the rest, but I guarantee to you that spending some time to read some introductory book or tutorial will enable you to solve problems like this and to understand code like the above, since I assure you that there's nothing particularly advanced or complex involved.

      Probably... but remember that uniq removes consecutive dupes versus all dupes at any point. (not sure which the op was after)
      $ cat afile he she he he $ uniq afile he she he
Re: Deleting duplicate lines from file
by turo (Friar) on Feb 16, 2006 at 06:51 UTC

    #!/usr/bin/perl use strict; use Digest::MD5 qw(md5); my (@lines, %line_md5); while (<>) { unless (/^\s*$/) { my $digest = md5($_); unless ( exists $line_md5{$digest} ) { $line_md5{$digest} = 1; push @lines, $_; } } else { push @lines, $_; } } print @lines;
    If you need to order your code:
    cat file | sort | uniq

    Hope that helps :-)

    update oops, i forget to say: supersearch is your friend ...

    perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'
      I am in the process of trying to get this code to work. Please could someone offer me a detailed explanation of how this works so I can fix problems that I am having with it.

      For example, I don't understand this line:
      my (@lines, %line_md5);
      or these lines:
      my $digest = md5($_); unless ( exists $line_md5{$digest} ) { $line_md5{$digest} = 1;

        my

        unless

        It looks like perldoc is down....so try these links:

        my

        unless (BTW does anyone know where perldoc.perl.org keeps the info about unless and other control structures?)

        You hurt me ...
        program file
        thats enough? ....

        i recommend you to read some perl guide
        perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'
        So how much are you getting paid for this one? How much do we get?

        It works nearly in the same exact way as the code you already asked about except that:

        1. the flow control is syntactically (but not logically!) different;
        2. it doesn't do a check on the actual strings, but on a checksum computed for them which gives a sufficient condition for two of them to be different. That is, if the checksums are different, then the strings will be different too, while the converse does not hold: different strings may have the same checksums (but we rely on the confidence that such occurrencies are rare enough), hence => my remarks

        Then you wrote:

        For example, I don't understand this line:
        my (@lines, %line_md5);

        Oh my! Please tell me you're joking!! I see that you have 204 writing as of now. That's quite surprising... maybe you're not really programming in Perl, but in some vaguely similar language. What is it precisely that you do not understand?

      While I often use (md5) sums, I think that this is an overkill for checking duplicate lines; and as usual exposes to the risk of false positives while for reasonably sized lines, which are to be expected in this case, it is quite reasonable to assume that the md5sum will have a size comparable to that of the string itself, or -depending on the actual data- even larger.

      Also, the code seems just a little bit too verbose for may tastes. Without that verbosity adding to readability, that is. However they're just tastes, so I won't insist too much on this point.

      Last, if one needs to print non-duplicate lines, it's pointlessly resources-consuming to gather them into an array to print them all together. Granted, this may be an illustration for a more general situation in which one may actually need to store all of them in one place. But the OP is clearly a newbie and I fear that doing so in this case would risk being cargo culted into the bad habit of unnecessarily assigning to unnecessary variables all the time.

      Oh, and the very last thing about your suggestion:

      cat file | sort | uniq

      The following is just equivalent:

      sort -u file

        Okay, i'll take my armor and my shield, and no axes (today i'm friendly)

        1. Why i use md5 digest hash instead of lines? It will be more easy to put into the hash the line, and then see if that line exist or not. But, the hash will grow too much in memory (i didn't know the size of the victim file, but i expected it to be too long)
        2. False positives. I don't believe that. MD5 digest is 16 bytes long, there are 2^(16*8) posible md5 digests. The probability to have a false positive is 1/2^(16*8) (i'm not mathematician, but I think it so). Its difficult to find two lines on a file with the same hash... Okay, if we want to be purists, should i add to the comparisson the number of characters of the line and the md5 digest.
        3. "is quite reasonable to assume the md5sum will have a size comparable to that of the string itself" i didn't take this assumption, and maybe you have the reason at this point (Win didn't say anything about this particular)...
        4. about my code ... i didn't wanted to do it obscure ... i wanted it to be understandable ... :'(
        5. if one needs to print non-duplicate lines ... okay, i did the code in a minute, trying to reply the question the most quickly as i could ... I solved the problem, so thats was enough for me in that moment... Okay, i assume that i didn't need to use the array, here i was wrong...
          #!/usr/bin/perl use strict; use Digest::MD5 qw(md5); my %line; while (<>) { my $digest = md5($_); unless ( exists $line{$digest} and $line{$digest} == length ) +{ $line{$digest} = length; print; } }

        Last of all, its nice to receive constructive critics ...

        turo

        PS: thanks for the abbreviation (sort -u file) i didn't knew it :-)

        perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'
Re: Deleting duplicate lines from file
by spiritway (Vicar) on Feb 16, 2006 at 18:51 UTC

    If the order of the lines is unimportant, sort them first, then examine them in order, testing whether the 'next' line matches the current line. Only print the lines where this is not true (there is no match).

Re: Deleting duplicate lines from file
by thundergnat (Deacon) on Feb 16, 2006 at 20:12 UTC

    Uncomment the appropriate print staement to get the output order you need.

    use warnings; use strict; @ARGV or die "You need to supply a file name.\n"; open my $fh, '<', shift or die "$!\n"; my @lines = <$fh>; my %unique; @unique{@lines} = (1) x @lines; # unique lines #print keys %unique; # unique sorted lines #print sort keys %unique; # unique lines in order seen in original file do { print if delete $unique{$_} } for @lines;
      @unique{@lines} = (1) x @lines;

      Also

      @unique{@lines} = ();

      since you don't use the values anyway. Whatever, if he wants them in the original order, then slurping the whole file in at once is, as is commonly the case, an overkill, and I would regard the usual print if !$seen{$_}++ technique as a superior solution. Of course if one needs or may need sorting then the slurping must take place in some form or another and yours is just as good as any other one. Probably you already knew, I'm just pinpointing some details for the benefit of the OP...

        Also

        @unique{@lines} = ();

        since you don't use the values anyway.

        Actually, it does need true values for each key or delete will return false. So, while it's true it doesn't use them, it does need them.

Re: Deleting duplicate lines from file
by blazar (Canon) on Feb 17, 2006 at 04:28 UTC
    Sure:
    perl -ne 'print if !$saw{$_}++' perl -pe'$s{$_}++and$_=$,'

      I like this minimalistic bit of code :-) ... You win!

      perl -pe'$_{$_}++and$_=$,'

      perl -Te 'print map { chr((ord)-((10,20,2,7)[$i++])) } split //,"turo"'

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://530636]
Approved by Corion
help
Sections?
Information?
Find Nodes?
Leftovers?
    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.