Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: finding and deleting repeated lines in a file

by bronto (Priest)
on Jun 19, 2002 at 17:04 UTC ( #175780=note: print w/ replies, xml ) Need Help??


in reply to finding and deleting repeated lines in a file

Uhm... let me think... file is huge, so it is not advisable to keep all the non-repeated sequences inside a hash, or your memory will blow.

I think it would be a good idea to use a message digest on the values to keep memory occupation low (at the expense of some CPU, of course); this also exposes you to the risk of two different sequences having the same digest -the probability should be low, but not null...

I have no data to test and I never used Digest::MD5 directly, so take the subsequent code as a suggestion -it may suit your needs or be completely wrong. I'm looking at the documentation on http://search.cpan.org/doc/GAAS/Digest-MD5-2.20/MD5.pm

use strict ; use warnings ; # ...if you have Perl 5.6 # read from stdin, spit data to stdout # (just to keep it simple) use Digest::MD5 qw(md5_hex) ; # or one of md5*'s my %digests ; while (my $line = <STDIN>) { my $dig = md5_hex($line) ; if (exists $digests{$dig}) { print STDERR "WARNING: duplicated checksum $dig for line $line +\nWARNING: skipping $line\n" ; $digest{$dig}++ ; # you can use this to count repetitions } else { $digest{$dig} = 0 ; print $line ; } }

If this not what you need, I hope that at least this can help you to reach the better solution.

--bronto


Comment on Re: finding and deleting repeated lines in a file
Download Code
Re^2: finding and deleting repeated lines in a file
by Aristotle (Chancellor) on Jun 19, 2002 at 19:49 UTC
    A suggestion to avoid possible collisions: if CPU time is not a concern, using several different algorithms to create multiple fingerprints increases the improbability of a collision to astronomically high figures. Even using the same algorithm on the original string and a variant created by some transliteration rules to obtain multiple fingerprints will exponentially decrease the probability of collisions.

    Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://175780]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (19)
As of 2014-07-25 17:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (174 votes), past polls