Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Removing repeated lines from file

by matth (Monk)
on Jun 24, 2003 at 11:26 UTC ( #268452=perlquestion: print w/replies, xml ) Need Help??

matth has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

What is the best way to go through a text file an remove lines that are repeated? I wish to otherwise maintain the order of the lines and there is too much data to go into an array.

Replies are listed 'Best First'.
Re: Removing repeated lines from file
by broquaint (Abbot) on Jun 24, 2003 at 11:44 UTC
    Time for one of them nifty one-liners
    perl -ni -e 'print if $last ne $_; $last = $_' your_file
    See. perlrun for more info.
    HTH

    _________
    broquaint

      Ive just created a file called repetitive.txt on my Cygwin system and used perl, v5.6.1 to try your one liner but it gave an error:

      Can't do inplace edit on repetitive.txt: Permission denied.

      And to my surprise the file was deleted. What can be the reason of that behaviour? Did I miss something?

        Perl on Windows can't use -i without an extension. I guess the same applies to Cygwin.

        Try -ni.bak instead.

        --
        John.

Re: Removing repeated lines from file
by ant9000 (Monk) on Jun 24, 2003 at 11:47 UTC
    If you can keep track of lines already read, it's trivial:
    %read=(); while(defined($_=<FILE>)){ if(!defined($read_lines{$_})){ print OUTFILE $_; $read{$_}=1; } }
    If that's too big for your memory to hold (is it, really?), you could try to get a unique signature for each line and save that. What about an MD5 hash of it? It's 32 bites per input line, so it could be a good starting point. Beware, MD5 is not that fast if you have billions of lines!
      This is the only solution, as far, that solves the general problem of repeating lines not only consecutive repeating lines ++. Just fix the name of the hash, now you use two: %read_lines and %read for what should be one I think.

      A compression (broquaint based):

      perl -ni -e 'print if $seen{$_}; $seen{$_} = 1' your_file

        Here is another variation:
        perl -i.bak -ne 'print unless $h{$_}++' file

        However, since these solutions store unique lines in memory they are potentially slow for very large files.

        --
        John.

        Ooops... I should never write untested code ;-)
      Well, it was already given that storing it all in an array was taking too much memory. Given that a hash uses even more overhead than an array, your solution only wins if there are many duplicates. And while an MD5 hash may only be 32 bits (4 bytes), a single Perl scalar already takes at least 24 bytes. Plus two handfuls of bytes for being a hash entry. And the overhead of the hash itself.

      Abigail

      Although it's unlikely, it is possible that two distinct lines could generate the same MD5 hash. Then you'd erroneously toss one of the lines, thinking it was a duplicate of the first.
Re: Removing repeated lines from file
by BazB (Priest) on Jun 24, 2003 at 11:49 UTC

    Have a look at the Perl Power Tool (PPT) project's implementation of the UNIX uniq command.

    You'll need to sort your data before passing it through Tomte's solution or uniq.


    If the information in this post is inaccurate, or just plain wrong, don't just downvote - please post explaining what's wrong.
    That way everyone learns.

      Considering that one of the requirements was to keep the order of the line, you'd also need to undo the sort.

      Abigail

      Each line is made up of 1 or two tab delimited lines. Could I just stick this into a MySQL database and remove duplicated rows using SQL?
Re: Removing repeated lines from file
by Tomte (Priest) on Jun 24, 2003 at 11:39 UTC

    You only have to remember the last unique-line, a shot from the hip (untested!):

    open(FILE, "<", name.ext") or die("couldn't open file: $!"); open(OUT, ">", new.ext") or die("couldn't open file for writing: $!") +; my $cmp = <FILE>; # the first line print OUT $cmp; # is always unique while(<FILE>) { if ($_ ne $cmp) { # current line unlike last one? print OUT $_; # print it out $cmp = $_; # current line new reference } # otherwise ignore }

    regards,
    tomte


    Hlade's Law:

    If you have a difficult task, give it to a lazy person --
    they will find an easier way to do it.

Re: Removing repeated lines from file
by diotalevi (Canon) on Jun 24, 2003 at 11:45 UTC

    This is a one-liner. If you add in the '-ibck' flag then you can modify the file directly.

    perl -ne 'next if $prev eq $_; print; $prev = $_'
Re: Removing repeated lines from file
by Abigail-II (Bishop) on Jun 24, 2003 at 12:50 UTC
    IMO, the best way is not to use Perl. There's an excellent utility doing exactly this, it's called uniq, which comes standard with any flavour of Unix I've encountered, and is also present in Unix toolkits for Windows.

    uniq in_file out_file

    Abigail

      uniq works on sorted files only. The man page suggests that it does remowe consecutive duplicates, but the first sentence states that it removes repeating lines from sorted files.

      Update: As Abigail-II pointed out according to POSIX the sugggestion is right. But anyway it works only for consecutive duplicates.

        uniq works on sorted files only. The man page suggests that it does remowe consecutive duplicates, but the first sentence states that it removes repeating lines from sorted files.

        Then either your man page lies, or your vendor uses a uniq implementation that's not confirming to the POSIX standard.

        From the POSIX 1003.1 standard:

        DESCRIPTION
        The uniq utility shall read an input file comparing adjacent lines, and write one copy of each input line on the output. The second and succeeding copies of repeated adjacent input lines shall not be written.

        Repeated lines in the input shall not be detected if they are not adjacent.

        Abigail

        My testing show that uniq does only remove duplicate lines from sorted rows.
Re: Removing repeated lines from file
by Anonymous Monk on Jun 24, 2003 at 20:13 UTC
    here's a ghetto recipe for this situation where memory is at a premium though, it seems, CPU / time is less so...

    1. step thru the document and append a line number to the end of each line. zero pad it to known length for later removal.

    2. sort the file.

    3. remove the line endings to another file, say, numbers.txt ,maintaining order (they'll be out of order of course, having been shuffled with the sort.

    4. step through the sorted file with a uniq-ish algorithm that'll remove consecutive dupes, *but* track the line numbers (as in offset from the begining) being deleted. Delete the corresponding line from the file from 3. i.e. if the 5'th line of text is a dupe and is removed, delete the 5th line in your numbers.txt.

    5. when you've gone through all the text, take the numbers.txt file and this time prepend it to the lines in the text file, line for line.

    6. resort the file. it should now be in it's original order.

    7. remove the line numbers.

    it ain't pretty and it's slow and i/o bound, but it won't use much memory.

      sigh, ++ to Anonymous

      if you have cool text tools (some sort/uniq/cut are really lame) it's even easier to accomplish

      $ cat -n config.log | head 1 This file contains any messages produced by compilers while 2 running configure, to aid debugging if configure makes a mista +ke. 3 4 It was created by configure, which was 5 generated by GNU Autoconf 2.57. Invocation command line was 6 7 $ ./configure 8 9 ## --------- ## 10 ## Platform. ## # lines 3,6,8 are the duplicates in this little test. # fields are tab seperated. $ cat -n config.log | head | # number the lines sort -k 2 | # sort on second field uniq -f 1 | # uniq skipping first field sort -k 1,1n | # numeric sort on first field cut -f2 # extract second field This file contains any messages produced by compilers while running configure, to aid debugging if configure makes a mistake. It was created by configure, which was generated by GNU Autoconf 2.57. Invocation command line was $ ./configure ## --------- ## ## Platform. ##
        Spoken like a true unix fan. :) Simple tools and pipes are cool.
      Of course, in reality the problem is un-solvable. A decision needs to be made in order to amend the problem statement and render a solvable problem. Which one of the N copies of any particular duplicate line should be preserved?
        Hm yes I was thinking the same thing: if the order is important, so is the decision about which one to throw away. Also, in the sample code that matth provides, he's pulling his input lines apart and categorizing them in some way that makes sense only to him, and might be useful for others to know about.

        Can we see some sample input?

Re: Removing repeated lines from file
by husker (Chaplain) on Jun 24, 2003 at 14:30 UTC
    The MD5 hash suggestion got me thinking.

    It seems like the two big obstacles are 1) the duplicate lines are not necessarily adjacent, and you cannot sort it to make them so, and 2) there's too much data to be held "in place".

    What if we could get around obstacle 2? Perhaps if we used some lossless compression on your input, we could reduce it's storage requirement. If the compression is lossless (i.e., the original can be reconstructed with perfect fidelity from it's compressed image), then if we compress two unique lines, their compressed results should also be unique.

    Depending on how much compression you are able to get, you may very well be able to process your input "in memory".

    OK I guess it really doesn't solve the storage problem per se, just kind of avoids it. It's possible that even with compression, your input stream is just too big.

      As my duplicate lines tend to be clustered I have tried this code:
      while(<INFILE>){ #print "here\n"; if ($_ =~ /^(.{1,50})\t{0,50}(.{0,50})\t{0,1}(.{0,50})\t{0,1}(.{0, +50})\t{0,1}(.{0,50})\t{0,1}(.{0,50})\t{0,1}(.{0,50})\t{0,1}(.{0,50})\ +t{0,1}(.{0,50})/){ #print "here_B\n"; $bumb = ""; $gene_id = $1; if ($2 =~ /.{1,50}/){ push (@alternative_ids, $2)} else{ push (@alternative_ids, $bumb); } if ($3 =~ /.{1,50}/){ push (@alternative_ids, $3)} else{ push (@alternative_ids, $bumb); } if ($4 =~ /.{1,50}/){ push (@alternative_ids, $4)} else{ push (@alternative_ids, $bumb); } if ($5 =~ /.{1,50}/){ push (@alternative_ids, $5)} else{ push (@alternative_ids, $bumb); } if ($6 =~ /.{1,50}/){ push (@alternative_ids, $6)} else{ push (@alternative_ids, $bumb); } if ($7 =~ /.{1,50}/){ push (@alternative_ids, $7)} else{ push (@alternative_ids, $bumb); } if ($8 =~ /.{1,50}/){ push (@alternative_ids, $8)} else{ push (@alternative_ids, $bumb); } if ($9 =~ /.{1,50}/){ push (@alternative_ids, $9)} else{ push (@alternative_ids, $bumb); } print @alternative_id; foreach (@alternative_ids){ $old_line = $new_line; $alternative_id = $_; $new_line = "$gene_id\t$alternative_id\n"; @record_previous_lines = qw/blank/; if ($new_line ne $old_line){ #discount consecutive repeats of +identical lines $switch = 1; foreach(@record_previous_lines){ if ($_ =~ /$new_line.$old_line/){ $switch = 2; } } if ($switch = 1){ # don't print lines that have previously bee +n repeated print OUT_B "$new_line$old_line"; } if (@record_previous_lines > 200){ # don't let this array get +bigger than 200 shift @record_previous_lines; # shift deletes the first va +lue of an array } push (@record_previous_lines, $new_line.$old_lines); } } undef @alternative_ids; } }
      The last bit doesn't seem to be working as I want. Can anyone thow light on this? Sorry I am not being more specific.

        In

        if ($switch = 1){ # don't print lines that have previously been +repeated print OUT_B "$new_line$old_line"; }
        should that not be
              if ($switch== 1)

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        OK well if you had said "I expect to find duplicate lines within 200 lines of the original" then we could have helped you sooner!

        :)

        So ... how do you want that "last bit" to work, and how is it really working?

        how clustered are your lines? here are a couple tweakable previous line recognizers.

        # # remember everything, probably uses too much memory. # { my %seen; sub seen_complete { return 1 if exists $seen{$_[0]}; $seen{$_[0]} = (); return 0; } } # # remember last N lines only. # { my %seen; my $remember = 200; my @memory; sub seen_fixed { return 1 if exists $seen{$_[0]}; delete $seen{shift(@memory)} if @memory > $remember; push @memory, $_[0]; $seen{$_[0]} = (); return 0; } } # # remember N buckets of lines with X lines per bucket. # { my @bucket = ( {} ); my $numbuckets = 2; my $bucketsize = 200; sub seen_bucket { foreach (@bucket) { return 1 if exists $_->{$_[0]}; } if (keys %{$bucket[-1]} >= $bucketsize) { shift @bucket if @bucket >= $numbuckets; push @bucket, {}; } $bucket[-1]->{$_[0]} = (); return 0; } }

        i only tested the last one, and only sorta tested it at that.

        while (<>) { print unless seen_bucket($_); } __END__ Ten sets of 1..400 should get uniq'd to 1..400 $ perl -le 'for(1..10){for(1..400){print}}' | perl dup.pl | wc -l 400 Ten sets of 1..401 should get uniq'd to (1..401) x 10 because 2 buckets of 200 lines hold upto 400 $ perl -le 'for(1..10){for(1..401){print}}' | perl dup.pl | wc -l 4010
Re: Removing repeated lines from file
by husker (Chaplain) on Jun 25, 2003 at 20:27 UTC
    After thinking about this WAY too long, two answers came to me: one kind of obscure, the other much more simple.

    The first one used set theory and recursion. It went like this:

    Until dataset is 1 line Split dataset into two halves Take intersection of sets Store intersection in duplicate list Split each dataset into two datasets, and repeat end Open original dataset file Until EOD read line compare to list of known duplicates if in that list if duplicate flag not marked emit line to output mark duplicate as emitted endif else emit line on output endif end
    I thought this was a pretty cool way to generate a list of duplicates. I believe there are modules on CPAN which can do this kind of set operation.

    Then I realized it should be much easier:

    Sort a copy of the datafile Open sorted copy Until EOD Read line Compare to previous line If line == previous line if line not in duplicate table put line in duplicate table endif else previous line = line endif end Open original data file Until EOD read line if line in duplicate table if duplicate not marked emit line on output mark duplicate line end else emit line on output endif end

    Both of these have the advantage of only needing to store the duplicate lines. Both have the disadvantage of having to read through the input set multiple times.

    Although the first solution seems more "cool" to me, the second is certainly more practical and likely faster (unless the dataset is so large you can't sort it either).

      Your second solution is exactly what the propositions using a hash all do.

      Makeshifts last the longest.

        It looked to me like they put the entire file contents in a hash. Mine only puts duplicate lines in the hash.
Re: Removing repeated lines from file
by Anonymous Monk on Jun 25, 2003 at 19:29 UTC
    PUT EACH LINE IN A HASH, IT'LL ELIM DUPES FOR YOU AUTOMATICALLY.
    open(INFH, "<filename.txt"); my @data=<INFH>; close(INFH); open(OUTFH, ">outfile.txt"); my %hashdata = (); foreach my $thisline (@data) { $hashdata{$thisline} = 1; } foreach my $thisline (sort(keys(%hashdata))) { print(OUTFH, $thisline) } close(OUTFH);

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://268452]
Approved by benn
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (3)
As of 2022-09-25 09:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer my indexes to start at:




    Results (116 votes). Check out past polls.

    Notices?