Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Huge data file and looping best practices

by przemo (Scribe)
on Apr 26, 2009 at 16:32 UTC ( [id://760149]=note: print w/replies, xml ) Need Help??


in reply to Huge data file and looping best practices

Looks like we have O(N^2*K) complexity (N=number of patents, K=number of characteristics) here... not good...

If you really have to count difference for every pair of patents, try using Bit::Vector for every line (i.e. one characteristic = one bit), calculate the number of different positions they have with XOR operation and count set bits in the resulting vector. This should take less memory and should be also faster. (?)

Although I have some doubts about the time needed to do so (you need to write to the output about (8*10^6)^2)/2 = 32 "teralines").

  • Comment on Re: Huge data file and looping best practices

Replies are listed 'Best First'.
Re^2: Huge data file and looping best practices
by carillonator (Novice) on Apr 26, 2009 at 16:55 UTC
    thanks @przemo. I'll check out Bit::Vector. We think there are only about 400,000 unique characteristic sets among the 8 million patents, so we'll probably end up dealing with that data set instead, on account of the absurd amount of data this program would return.

      If you have 400,000 unique characteristic sets among the 8 million patients, now you're getting somewhere. If you could find a way to consistently stringify a given set the same way each time it comes up, you could turn that into a hash key, and as its value create a datastructure of patient names. Now you have a workable structure that could be split into manageable files based on the groupings by unique characteristics.

      ...just a thought, though I'm not sure how helpful it is.


      Dave

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://760149]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-04-24 11:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found