Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Unique uniq implemenation

by feloniousMonk (Pilgrim)
on Mar 06, 2002 at 20:56 UTC ( [id://149842]=perlquestion: print w/replies, xml ) Need Help??

feloniousMonk has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I'm trying to write something to perform a uniq
on a large input file. Not too big, less than 10 meg
I would imagine.

Here's the kicker - the lines look like this:

NUMBER FIELD(s)

I need to uniq based on FIELD(s). Sorting by NUMBER
is secondary, not absolutely necessary.

NUMBER and FIELD(s) are separated by a single space
in all cases so a split is easy.

My first reaction was to use a hash but it's memory
consumption would be way too much.

Thanks,
felonious
--

Replies are listed 'Best First'.
•Re: Unique uniq implemenation
by merlyn (Sage) on Mar 06, 2002 at 21:03 UTC
    My first reaction was to use a hash but it's memory consumption would be way too much.
    Wait! I thought you said your entire data is only 10 meg! You could have a hash of all the FIELDs and still probably use up less memory than your STDIO buffers! Or did you mean something like "10 gig" instead of "10 meg"?

    -- Randal L. Schwartz, Perl hacker

Re: Unique uniq implemenation
by zengargoyle (Deacon) on Mar 06, 2002 at 21:28 UTC
    $ perl -e'while(1){printf"%d ",$i++;printf("field%d ",rand(20))for(0.. +rand(20));print"\n";}' >foo.dat <CTRL-C> $ head -4 foo.dat 0 field11 field6 field9 field14 field19 field11 field0 field2 field18 +field10 field4 field17 field17 1 field18 field12 field19 field9 field15 2 field3 field16 field12 field17 field2 field10 field10 field1 field1 +field10 field5 field1 field5 field11 field2 3 field11 field2 field19 field16 field19 field15 field6 field10 field2 + field7 field17 field8 field4 $ ls -l foo.dat -rw------- 1 notroot 14753792 Mar 6 13:13 foo.dat $ time perl -lane '$n=shift@F;$f="@F";push@{$d{$f}},$n;END{for(sort ke +ys %d){print"$_ -> @{$d{$_}}" if (@{$d{$_}}>1);}}' <foo.dat >foo.log real 21.1 user 18.9 sys 0.9 $ tail -4 foo.log field9 field9 field3 -> 50797 123541 field9 field9 field5 -> 25185 66389 134175 138790 field9 field9 field6 -> 8571 93213 field9 field9 field6 field2 -> 64192 151266

    topped out at SIZE 60M RES 60M on my piddly 500MHz SunBlade.

    Are you sure about your memory consumption? I grok bigdata daily with perl.

Re: Unique uniq implemenation
by vladb (Vicar) on Mar 06, 2002 at 21:06 UTC
    I'd say just use a hash. 10MB is not much. I have a few scripts that process even larger files with ease using perl hashes.

    I need to uniq based on FIELD(s)...
    


    However, if you are intent on staying away from having to deal with 'large' hashes, could you please elaborate more on what is involved in 'uniq'? Do you simply want to weed out similar records (e.g. collapse large data files)? Or, sort the file based on certain fields?


    "There is no system but GNU, and Linux is one of its kernels." -- Confession of Faith
Re: Unique uniq implemenation
by feloniousMonk (Pilgrim) on Mar 06, 2002 at 21:45 UTC
    Would've been nice of me to elaborate, sorry :-)

    OK - Now I have about 10 meg, maybe more. I don't know
    what I will have down the road. Maybe 100 meg next time
    around?

    And the data - it's a single number (a count) plus a
    text descriptor. Descriptor may have whitespace, but
    I can count on only one space between number and descriptor

    Now, what I need to do is take all the lines that match
    descriptor, add their counts, and print them.

    i.e.,
    my ($freq, $word) = split; $freq_hash{$word} += $freq;

    does the job in a rather unscalable way

    Make sense?

    Thanks again,
    felonious
    --
      If you are concerned about size and memory consumption then I'd suggest looking at DB_File, MLDBM and Storable. You can then tie your hash to the disk. However, you have shifted your memory problems to disk problems. But generally speaking HDD space is cheaper than memory space.

      However, as has been mentioned above, Perl can handle very large sets of data quite easily.
        I do that for another implementation but it seems to be
        very slow. I want the world! But actually, I've decided
        to stick with the hash until the data gets
        too big, then I'll probably do a DB_Hash.

        Thanks for the help everyone!
        -felonious
        --

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://149842]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-04-19 01:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found