Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

Re: finding and deleting repeated lines in a file

by bronto (Priest)
on Jun 19, 2002 at 17:04 UTC ( #175780=note: print w/replies, xml ) Need Help??

in reply to finding and deleting repeated lines in a file

Uhm... let me think... file is huge, so it is not advisable to keep all the non-repeated sequences inside a hash, or your memory will blow.

I think it would be a good idea to use a message digest on the values to keep memory occupation low (at the expense of some CPU, of course); this also exposes you to the risk of two different sequences having the same digest -the probability should be low, but not null...

I have no data to test and I never used Digest::MD5 directly, so take the subsequent code as a suggestion -it may suit your needs or be completely wrong. I'm looking at the documentation on

use strict ; use warnings ; # ...if you have Perl 5.6 # read from stdin, spit data to stdout # (just to keep it simple) use Digest::MD5 qw(md5_hex) ; # or one of md5*'s my %digests ; while (my $line = <STDIN>) { my $dig = md5_hex($line) ; if (exists $digests{$dig}) { print STDERR "WARNING: duplicated checksum $dig for line $line +\nWARNING: skipping $line\n" ; $digest{$dig}++ ; # you can use this to count repetitions } else { $digest{$dig} = 0 ; print $line ; } }

If this not what you need, I hope that at least this can help you to reach the better solution.


Replies are listed 'Best First'.
Re^2: finding and deleting repeated lines in a file
by Aristotle (Chancellor) on Jun 19, 2002 at 19:49 UTC
    A suggestion to avoid possible collisions: if CPU time is not a concern, using several different algorithms to create multiple fingerprints increases the improbability of a collision to astronomically high figures. Even using the same algorithm on the original string and a variant created by some transliteration rules to obtain multiple fingerprints will exponentially decrease the probability of collisions.

    Makeshifts last the longest.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://175780]
[ambrus]: Corion: those are good rules.
[ambrus]: Discipulus: oh sure. the input data has different filenames every time I get them.
[ambrus]: the directory structure may be 1, 2, or 3 deep, it may have spaces in the filename or not, it has dates in various format, different keywords for the same meanings, and the dates and other keywords are assembled in various ways.
[Discipulus]: no ambrus by specification i mean for example license per core instead of per socket, so fields are different

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (13)
As of 2017-03-29 12:18 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (350 votes). Check out past polls.