Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: finding and deleting repeated lines in a file

by bronto (Priest)
on Jun 19, 2002 at 17:04 UTC ( #175780=note: print w/ replies, xml ) Need Help??


in reply to finding and deleting repeated lines in a file

Uhm... let me think... file is huge, so it is not advisable to keep all the non-repeated sequences inside a hash, or your memory will blow.

I think it would be a good idea to use a message digest on the values to keep memory occupation low (at the expense of some CPU, of course); this also exposes you to the risk of two different sequences having the same digest -the probability should be low, but not null...

I have no data to test and I never used Digest::MD5 directly, so take the subsequent code as a suggestion -it may suit your needs or be completely wrong. I'm looking at the documentation on http://search.cpan.org/doc/GAAS/Digest-MD5-2.20/MD5.pm

use strict ; use warnings ; # ...if you have Perl 5.6 # read from stdin, spit data to stdout # (just to keep it simple) use Digest::MD5 qw(md5_hex) ; # or one of md5*'s my %digests ; while (my $line = <STDIN>) { my $dig = md5_hex($line) ; if (exists $digests{$dig}) { print STDERR "WARNING: duplicated checksum $dig for line $line +\nWARNING: skipping $line\n" ; $digest{$dig}++ ; # you can use this to count repetitions } else { $digest{$dig} = 0 ; print $line ; } }

If this not what you need, I hope that at least this can help you to reach the better solution.

--bronto


Comment on Re: finding and deleting repeated lines in a file
Download Code
Replies are listed 'Best First'.
Re^2: finding and deleting repeated lines in a file
by Aristotle (Chancellor) on Jun 19, 2002 at 19:49 UTC
    A suggestion to avoid possible collisions: if CPU time is not a concern, using several different algorithms to create multiple fingerprints increases the improbability of a collision to astronomically high figures. Even using the same algorithm on the original string and a variant created by some transliteration rules to obtain multiple fingerprints will exponentially decrease the probability of collisions.

    Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://175780]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (10)
As of 2015-07-31 05:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (274 votes), past polls