Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Comparing each line of a file to itself

by kschwab (Vicar)
on Jan 13, 2019 at 13:59 UTC ( #1228466=note: print w/replies, xml ) Need Help??


in reply to Comparing each line of a file to itself

If I remember right, DNA sequence files are often very large. You could trade off cpu for memory by comparing hashes of each line instead of the line data itself. If a typical sequence file is 60 characters per line, that's 480 bits, so a 128bit MD5 digest would use significantly less memory, but more cpu.
use warnings; use strict; use Digest::MD5 qw(md5); my %SEEN; while (<>) { chomp; my $digest=md5($_); if ($SEEN{$digest}++) { printf STDOUT "Dup: [%s] seen %d times\n",$_,$SEEN{$digest}; } }

Replies are listed 'Best First'.
Re^2: Comparing each line of a file to itself
by bliako (Monsignor) on Jan 13, 2019 at 20:27 UTC
     60 characters per line, that's 480 bits

    why 60x8=480bits when 1 character = [ATGC] = 2 bits?

      Well, yes, a content aware solution could mush down to 2 bits per character. I was proposing, though, something more memory efficient than $SEEN{$_}++.

      I don't know much about DNA, but googling around a bit, "one DNA sequence per line" could mean 237, 373, etc, characters per line. 373*2= 746, so an MD5 hash could still be significantly smaller.

      Also, I don't know if OP's file format has comments or other things besides A/T/G/C.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1228466]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2021-09-19 14:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?