|Problems? Is your data what you think it is?|
Comparing two arraysby baxy77bax (Chaplain)
|on Dec 15, 2013 at 10:20 UTC||Need Help??|
baxy77bax has asked for the
wisdom of the Perl Monks concerning the following question:
I realize this is a long shoot but maybe , just maybe someone willing to share can help me. The problem that I am trying to solve involves two integer arrays. One is static and one changes. Let x be the array whoes content does not change and y the one that changes. both arrays are 15000 int's long and they only contain values of 0 and 1. Example:
there are about 5000000 y-arrays that i need to compare with 100000 x arrays. What I could do is to start from 0 and pairwise compare each entry(x1<=>y1, x1<=>y2 ... x2<=>y1, ...xn<=>yn). Though fast this is just not fast enough (by my calculations that could last several days on a cluster machine i have at my disposal and i need to repeat this app. 1000 times each month).
the other thing i could do is to preprocess y arrays so that i only store positions where 1's are. The for each y just check if x[y[i]] == 1 since there are about 500 times less 1's then 0's this is going to run much faster (and will save a lot of memory). And this is where i stopped. However, with this approach it takes me 1 sec to compare one x with all y arrays. When extrapolated to 100000 x arrays this turns out to be a lot of time, especially when considered that i need to repeat this 1000 times. Aditional information. My end result is not to know, for each (x,y) array how many 1's they share just to know what are the top 10 y arrays that share the most 1' with each x array. Distribution od 1's in both y and x arrays is random (by the looks of it they maybe uniformly distributed) My goal is to speed this comparison up at least 3 times because then i can do something with it. Obviously to speed this up i need to do as less comparisons (checkups) as possible. I was thinking about computing a hash key (a checksum of come sort) but then i would be able to identify only identical (x,y) pairs. I was also thinking of converting x array to a bit array but am not sure how will this help me speed-wise. I did rewrite this in c and i gained a lot of speed but still not enough. As you can see I am stuck. So if anyone has any input for me please do not hesitate to post, no matter how crazy or slow the idea is. And i apologize if this is not strictly speaking a Perl (syntax, code -wise) related issue.
Maybe there should be a subforum for this type of questions/discussions as Perl is being used more and more for theoretical/conceptual problem solving tasks , because you don't have to worry about technicalities and since a lot of people that do this type of work are coming from c there is no syntax leap.
Anyway, thank you