Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Processing data with lot of math...

by BrowserUk (Patriarch)
on May 12, 2004 at 17:07 UTC ( [id://352838]=note: print w/replies, xml ) Need Help??


in reply to Processing data with lot of math...

I think that it is possible to do this with one loop, and almost no math!

However, I could be missing something, so read, digest, cogitate and draw your own conclusions.

For example, if you have an atom located at point ( 10, 10, 10 ), and the cut-off distance is 1 unit, then any atom that will be paired with this atom will have to be somewhere within an box defined by the points

p( 09, 09, 09 ) p( 09, 11, 09 ) p( 09, 11, 11 ) p( 09, 11, 11 ) p( 09, 11, 11 ) p( 11, 09, 09 ) p( 11, 09, 11 ) p( 11, 11, 11 )

So, instead of storing the x:y:z in separate arrays, or some other complex data structure, create a single array, indexed by the concatentations of x:y:z, (with leading zeros).

$atoms[ 090909 ] = 'name1'; $atoms[ 091109 ] = 'name2'; $atoms[ 091111 ] = 'name3'; $atoms[ 091111 ] = 'name4'; $atoms[ 091111 ] = 'name5'; $atoms[ 110909 ] = 'name6'; $atoms[ 110911 ] = 'name7'; $atoms[ 111111 ] = 'name8';

Once you have created this (sparse) array, it becomes very easy to pick out the limited set of possible pairings for any given point. It requires a final math check to exclude those point lying in the corners of the box, but the number of calculations should be significantly less than the brute force method.

There are some problems with this.

  1. I'm using integer coordinates, yours may well not be.

    If so, I would use a scaling factor upon each of the three coordinates to bring them to some manageable range of integers, and store the real coordinates along with the name. I would do this as a single string rather than as an array ref, as it costs very little extra memory to store a longer scalar in an array element, but creating thousands of small arrays costs lots.

  2. Depending upon how easily your coordinates map to integer ranges, and the size of those ranges, the size of the array will likely be enormously sparse, and too big for memory.

    In which case it makes sense to use a hash to implement a sparse array. The downsides of this is that the hash takes substantially more memory, and you then need to sort the hash keys prior to processing them.

    Depending upon how meany thousands of atoms we are talking about, this could well lead you to requiring more memory than is available, or even possible (assuming a 32-bit processor).

At this point the solution becomes to process the files and create a single output file where each record is of the form

xxyyzz name x.xx y.yy z.zz

A single pass to create the file. Then sort the file (using your systems sort utility) and the a single pass to read the sorted output and do the final comparisons and write out the pairs in whatever format you need them.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Replies are listed 'Best First'.
Re: Re: Processing data with lot of math...
by itub (Priest) on May 12, 2004 at 18:44 UTC

    Hey, this approach works really well! I'll include it in my next version of Chemistry::Bond::Find. It scales almost linearly, maybe even better than the recursive partitioning algorithm as I implemented it, and it avoids some of the overhead of the recursive calls and the creation and passing of lots of data structures.

    For my test file with 1717 atoms, my recursive algorithm took about 9.7 seconds, while the new method based on your post took 3.2 seconds.

    I should remember that you can throw hashes at almost any problem with Perl...

      That's good. Unfortunately, anything involving recursion in perl tends to hit performance quite badly once the volumes of data rise.

      If the mapping to integer coords is working okay, then you could (potentially) save a bit more time on large volumes of data by concatenating the pack'd integers rather than the ascii representations.

      $hash{ join'', pack 'N3', $x, $y, $z } = 'name x.xx y.yy z.zz';

      This would allow you to do an alpha sort rather than a numeric one, and avoids the need to add leading zeros to the values. It might also conserve a little space, though you are unlikely to see this unless your resultant concatenated ascii values are greater than 12 characters.

      You'd need to benchmark some typical datasets to see what if any benefit this would have, and where an transitions occured.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail

        Regarding the concatenations, my implementation has several differences with respect to your original proposal:

        • The real coordinates are normalized by dividing by the cutoff (which is relatively small), and rounded to the next lower integer (i.e., floored). The "concatenation" is numerical; $n = 1000000*$x + 1000*$y + $z + 500500500. Problem here: I'm assuming that the entire system is within +/- 500*$cutoff. I'll have to fix that somehow.
        • I use a hash to store the atoms. Each "bucket" ($hash{$n}) may contain more than one atom (usually 1-4), so it's actually an array reference. Atoms are Chemistry::Atom objects, which know their own symbol, name, real coordinates, etc.
        • The buckets don't need to be sorted; I just loop for each of them, and then check for bonds with the N^2 algorithm within the bucket (remember N is small), and then check for bonds with the atoms in the neighboring buckets (this requires N*M checks if there are M atoms in the neighbor). "Neighbor" is defined in a way that avoids counting the same pair twice.

        Note: the neighbors of $n are $n+1, $n+1000, $n+1000000, etc., for a total of 13 neighbors (a cube is surrounded by 26 cubes, but you only need half of them to avoid counting twice)

Re: Re: Processing data with lot of math...
by qhayaal (Beadle) on May 14, 2004 at 06:16 UTC
    Yes, but can this be simplified?

    For example, I can store all the boxes that exist in an array as: $box(i)="L:M:N". Now I can store cooridinates as:

    $x[l][m][n][i]=Xvalue of i'th atom in the box $y[l][m][n][i]=Yvalue of i'th atom in the box $z[l][m][n][i]=Zvalue of i'th atom in the box
    Then all I would need to do is to loop over all boxes (using @box) and checking existence of it's neighbouring boxes (I would need a hash here, I guess... Emmm..., box should be a hash and I can use its values in the above), and for all those exists I can loop over their atoms.

    Does this make sense or sound better?

      Okay. From this post I think that my original post did not explain the method I was trying to outline very well. Also, looking back at itubs reply, he also seems not to have fully understood what I was getting at.

      I've put together a crude simulation using some randomly generated data, and it seems that using my method will process: 10,000 atoms, in around 17 seconds using < 6 MB memory. 20,000 atoms, in around 112 seconds using < 11 MB 30,000 atoms, in around 350 seconds using < 14 MB.

      From these crude tests, I think it would handle 100,000 atoms in around 2 1/2 hours using 40 MB ram.

      Update Ignore the figures above. I discovered a stupidity in my testcase that meant I was doing way more work than I needed to. I can now processing 100,000 atoms using 50MB in under 4 minutes.

      This is without any attempt to really optimise things. I have no real feel for how this compares with other methods as I don't know what other methods are used.

      I don't know for sure that I am processing the data correctly.

      If you have a smallish (1000 or 2000 atom) dataset (with the desired results for a specified cutoff), that you could make available to me somehow?

      Posting this here is probably not a good idea:), but I could let you have my email id.

      Then I could see if what I am doing even approximates to giving correct results before I go posting triumphantly.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://352838]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2024-04-23 17:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found