BrowserUk has asked for the
wisdom of the Perl Monks concerning the following question:
I'm testing an algorithm involving 2D coordinate pairs. I've been generating random datasets like this: my @points = map[ int( rand 500 ), int( rand 500 ) ], 1 .. $N;
Which is okay as far as it goes, but being pseudorandom, tends to pretty equally distribute the points throughout the 2D plane.
But the worse case scenarios for the algorithm are when you get a bunch of points grouped in one place and then a few outliers far away. Either in another bunch, or widespread.
I could generate a point; generate a bunch around that point (using a subrange of the coordinate space, and then fully randomly add a few more, b u u t... that doesn't seem kosher.
So, any thoughts on a 'better way'?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Randomly biased, random numbers. by Lennotoecom (Monk) on Dec 05, 2013 at 22:09 UTC 
the whitenoise is the only true way mate  [reply] 
Re: Randomly biased, random numbers. by RichardK (Priest) on Dec 05, 2013 at 23:05 UTC 
I think it might be worth trying polar coordinates, so that each new point is a random distance and direction form the last. Then you can tweak your distance distributions to suit your purpose. If you need to restrict the total area, you could wrap the 2d space or just ignore any points that fall outside.
Math::Trig has handy polar conversion functions, so it would be easy to give it a try.
 [reply] 

my $start = [ int( rand 500 ), int( rand 500 ) ];
my @points = map{
my $a = rand( 2*PI );
my $d = int( rand 500 );
my $x = $d * cos( $a );
my $y = $d * sin( $a );
$x = 500 while $x > 500;
$x += 500 while $x < 0;
$y = 500 while $y > 500;
$y += 500 while $y < 0;
[ $x, $y ];
} 1 .. $N;
Whilst it certainly tends to create clusters, the clusters tend to be (very) evenly distributed, which doesn't trigger the worse case.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] [d/l] 

my $d = max_distance * ( 1  log( rand(1000) ) / log( 1000 ) );
 [reply] [d/l] 

Re: Randomly biased, random numbers. by roboticus (Canon) on Dec 05, 2013 at 23:11 UTC 
BrowserUk:
I used to use a "density map" generator back in the day. Basically each time you want a random coordinate, you generate the coordinate plus one extra random number. You then check the number against a density function, and if the extra random number is less than the density at that location, you return the coordinates. Otherwise you try again:
sub generate_coordinate {
my $fDensity = shift;
my @coordinate;
{
@coordinate = (rand, rand);
redo if $fDensity>(rand, @coordinate);
}
return @coordinate;
}
sub density_uniform { 1 }
sub density_h_gradient { $_[0] < $_[2] }
sub density_v_gradient { $_[0] < $_[1] }
sub density_broken [ 0 }
sub density_lookup {
my ($chk, $r, $c) = @_;
return $lookup_table[$r][$c]<$chk;
}
So by defining an appropriate density map function, you can create many types of distributions. The disadvantage is that your density map function may reject too many candidates, slowing things down. (The density_broken function, for example, is quite useless in this regard.)
The reason that I thought you might like it is this: For testing a project, I would create distinct distributions by simply drawing an image with MSPAINT or similar and loading that into a lookup table and using a function like density_lookup.
I don't know your objections to your own suggestion, so I don't know if this one is interesting or not, though.
...roboticus
When your only tool is a hammer, all problems look like your thumb.  [reply] [d/l] 

I don't know your objections to your own suggestion,
Basically, it is too directed. That is, it requires parameters to be chosen  the ratio between clustered picks and nonclustered; the size of clustering subrange; etc.  which means I would essentially be choosing what to test and thus excluding anything I haven't thought of.
so I don't know if this one is interesting or not
This is very interesting. I particularly like the idea of using images  whether handdrawn, or grabbed at random from an image search  to bias the picking process. It has so many possibilities ...
Eg. grab a random image, process the image with a filter to reduce it to a just points of a particular color or hue; or maybe use a Conway's Life type process to manipulate the pixels until groups of similar hues the reduce to single points; or a dozen other ideas; and then use those points as my dataset.
The only problem with the idea is that it has triggered so many possibilities for investigation, I might never get back to testing the algorithm :) Thanks for kicking down the doors on the box of my thought train!
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] 

Math::Vector::Real::MultiNormalMixture generates density functions which may suit the OP case and that can also be randomly parametrized.
Anyway, one problem with this approach is that the ratio of a discarded points could be too high.
In that case, a more efficient way may be to divide the plane in regions (i.e. triangles), calculate the probability of every region and then generate the random points first picking a region and then a point inside the region with your proposed algorithm using the conditioned density function.
 [reply] 

salva:
Yes, the discard rate can be a problem. For the table lookup version in 2D, I had a speedup that worked decently: Generate the first coordinate, then use the same technique to generate the remaining coordinate. It can still be a problem, though, in the event you choose a "mostly black" line, but I never needed made anything better than that.
I also tried to make a "transformational" technique that wouldn't reject any coordinates, but never got it working well enough to use. (Rather: the technique worked fine, but coming up with the warping functions for the project was more difficult (for me, at any rate), so I simply tended to run the densitymapping version before going to lunch, going home for the day, etc.
The intention was: Generate a random coordinate, and "remap" it based on a displacement function. I was hoping to be able to turn a density function into a space warping function. The difficulties I had were primarily coming up with functions to warp space appropriately, and ensuring that I could hit any point in the desired output space without too much overhang. (If the function moved the point outside the desired range, you had to reject the point and try again anyway.)
Update: Made edit above.
...roboticus
When your only tool is a hammer, all problems look like your thumb.
 [reply] 
Re: Randomly biased, random numbers. by Laurent_R (Priest) on Dec 05, 2013 at 23:18 UTC 
Hmm, I had a similar problem a couple of years ago, trying to test worst case scenarios, although in a totally different context. I don't claim at all to be an expert on this kind of things, but my feeling is that if you want to test worst case scenarios, then you have to bite the bullet and accept that your input data is not going to be random, and there is nothing wrong with that since you are specifically looking for not random situations. Then, of course, the problem is that you don't necessarily know for sure what is the true worst case scenario, sometimes it is actually very difficult to find out. For example, Shell sort and comb sort are known to be fairly efficient sorting algorithms, but they are very rarely used because nobody seems to really know what their worst case scenario could be and it is therefore difficult to assess their worst case complexity.
 [reply] 
Re: Randomly biased, random numbers. by educated_foo (Vicar) on Dec 06, 2013 at 01:55 UTC 
Interesting question. You might want to be more explicit about exactly what characteristics you want in this distribution, i.e. what phenomenon generates your data. For example, I'm not a physicist, but for beams of particles with varying intensities hitting a detector plate subject to random noise:
 Cluster means distributed uniformly over a rectangular area.
 Number of points per cluster following an exponential distribution (or whatever the beam intensities follow).
 Points in each cluster distributed as a 2D Gaussian. (With variance fixed or chosen according to some distribution?)
 Number of clusters either fixed or chosen according to some distribution.
 [reply] 

You might want to be more explicit about exactly what characteristics you want in this distribution,
That's hard. Mostly because I definitely do not want any formally defined distribution. Almost exactly the opposite in fact.
The problem with "random" is that on average, the points will be evenly distributed over the range (2D plane in this case). That's almost the exact definition of PRNG.
Whilst with enough iterations, all possible distributions  including those where a majority of the points in the sample size tend to be grouped or clustered on one side or in one corner of the plane  with even a relatively small plane, (500,500) and sample size 100  there are so many 'roughly even' distributions and so few 'lopsided' distributions, that I'd need to run billions of sets to ensure I'd tested a few of the lopsided ones. That's not practical.
So, I'm looking for a way to induce lopsided  which pretty much means not 'roughly evenly distributed'  distributions, without prescribing where or why the concentrated and sparse bits will be.
I can't think of any better description than: I want lopsided distributions that are completely randomly generated.
Not good I know, but its the best I've got.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] 

I definitely do not want any formally defined distribution.
Do you have sample data you can perturb, mix, or otherwise use to generate test data?
I want lopsided distributions that are completely randomly generated.
Hm... Maybe transform your PRNG through a random, monotonic, nonlinear mapping? e.g. generate a piecewiselinear function (or spline) in each dimension with steps taken from the PRNG, then generate uniform random points and apply the function to them. I suspect a Real Statistician would scoff, but I am not such a person.
 [reply] 


 
That's hard. Mostly because I definitely do not want any formally defined distribution
Download random pictures from the Internet and use them as the base to generate density functions.
You may apply some simple transformations (for instance, dynamic range decompression) to obtain more disparate distributions.
 [reply] 


 Re: Randomly biased, random numbers. by CountZero (Chancellor) on Dec 06, 2013 at 07:26 UTC 
Did you already have a look at Math::Random::OO?
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity."  The Tao of Programming, 4.1  Geoffrey James My blog: Imperial Deltronics
 [reply] 

 [reply] 

It can "plug in" (and actually also provides) different types of distribution function which may better serve your needs.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity."  The Tao of Programming, 4.1  Geoffrey James My blog: Imperial Deltronics
 [reply] 
Re: Randomly biased, random numbers. by Corion (Pope) on Dec 06, 2013 at 08:10 UTC 
I wanted an "aesthetic" distribution of points over a plane, so I wrote Random::PoissonDisc. This module tries to distribute points "equally" across a plane with a minimum distance between each other.
At the release time of the module, I wrote a blog post on Random::PoissonDisc which has graphics showing the distribution differences of "white noise" random and the "blue noise" random that Random::PoissonDisc produces.
If the distribution properties of Random::PoissonDisc match your requirements, you will likely want to reimplement the module using PDL if you want better performance. The paper which outlines the algorithm is also linked from the module documentation.
 [reply] 

Actually, that seems to be almost the antithesis of what I'm after. It seeks to restrain the randomness to a regular(ish) grid pattern. My problem is that randomly picking points in the plane is too uniform.
I want clumps, but I want them at a larger scale than they form naturally from purely random distribution.
However, the article you linked, linked an article that linked an article on (something possible or possibly not) called "Perlin Noise" which, whatever it is called, looks like it could be just what I need ... Thanks for the lead.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] 
Re: Randomly biased, random numbers. by hdb (Parson) on Dec 06, 2013 at 08:46 UTC 
Try to find pictures of copulas, google "copula plots". Those should give you ideas how you want your data to look like. Once you decide for one, there usually is an algorithm to generate the data corresponding to that copula.
 [reply] 

 [reply] 
Re: Randomly biased, random numbers. by QM (Vicar) on Dec 06, 2013 at 09:46 UTC 
Sounds like you want to a number of distributions, from "one big clump" to "almost uniform", at random.
What about generating several "poles" as clump attractors, each with a random weight for attractiveness. Then let each point "roll" down hill according to the attractor weights and distances. So generate 3D coordinates, and use the extra value as the weight. Something like this:
for my $trial (1..$max_trials) {
my @attractors;
for my $attractors (1..int(rand($max_attractor))+1) {
my @attractor = get_random_tuple(3,\@attractor_limits);
push @attractors, \@attractor;
}
my @grid_points;
for my $points ($min_points..$max_points) {
my @point = get_random_tuple(2,@grid_point_limits);
# $alpha limits the distance moved downhill, and could be rand
+om
@point = roll_down_hill(\@point, \@attractors, $alpha);
push @grid_points, \@point;
}
my $result = analyze_data(\@grid_points);
}
sub get_random_tuple {
my $size_of_tuple = shift;
my $limits_ref = shift;
my @tuple;
for my $limit (@$limits_ref) {
push @tuple, rand($limit);
}
return @tuple;
}
sub roll_down_hill {
# left as an exercise for OMAR!
}
You can also play with variations of roll_down_hill to try different distributions, going from sublinear to exponential (or maybe some of each), trying a host of different functions for transformations. And don't forget you can play with the distributions wherever you take a random number, including the number of attractors and their weights. For instance, you might find that fewer attractors are more likely to cause worst case behavior, and bias the number of attractors in the small direction.
QM

Quantum Mechanics: The dreams stuff is made of
 [reply] [d/l] [select] 
Re: Randomly biased, random numbers. by salva (Monsignor) on Dec 06, 2013 at 09:50 UTC 
 [reply] 

Yes. I remember playing with them for days.
Whilst it does eventually produce really nice clustering, it does take an inordinate amount of time. And then there is the problem of deciding when is enough.
I'd prefer to find something a little more direct. (BTW:Love the vids.)
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] 
Re: Randomly biased, random numbers. by kcott (Abbot) on Dec 06, 2013 at 10:09 UTC 
G'day BrowserUk,
"But the worse case scenarios for the algorithm are when you get a bunch of points grouped in one place and then a few outliers far away. Either in another bunch, or widespread."
Perhaps the following might be suitable.
The "bunch of points grouped in one place" tend towards the centre of the plane;
the outliers are typically widespread rather than forming outlying clumps.
#!/usr/bin/env perl
use strict;
use warnings;
my $rand_bias; # Boolean (for testing)
my ($x_high, $y_high) = (80, 80);
my $N = 10_000;
for my $rand_bias_bool (0, 1) {
$rand_bias = $rand_bias_bool;
print '*** ', $rand_bias ? 'WITH' : 'NO', ' RANDOM BIAS', " ***\n"
+;
my @points = map [ gen_rand($x_high), gen_rand($y_high) ], 1 .. $N
+;
print_matrix(\@points);
}
sub gen_rand {
my $rand_high = shift;
return int rand $rand_high unless $rand_bias; # For testing only!
my $high_part = $rand_high;
my $rand_sum = 0;
while ($high_part) {
my $rand_arg = 1 + int rand $high_part;
$high_part = $rand_arg;
$rand_sum += int rand $rand_arg;
}
return $rand_sum;
}
sub print_matrix {
my $coords = shift;
my %matrix;
for (@$coords) {
my ($x, $y) = @$_;
++$matrix{$x}{$y};
}
for my $y (reverse 0 .. $y_high  1) {
for my $x (0 .. $x_high  1) {
if (exists $matrix{$x}{$y}) {
$matrix{$x}{$y} = '@' if $matrix{$x}{$y} > 9;
}
else {
$matrix{$x}{$y} = ' ';
}
print $matrix{$x}{$y};
}
print "\n";
}
}
Here's a sample run showing an (ASCII) graphical representation of the generated data.
Note that I've only used an 80x80 grid (rather than your posted 500x500) to get a reasonable display.
Update: I removed two lines with $loops which I'd used for testing but aren't part of the solution.
 [reply] [d/l] [select] 

That does induce clumping, but it is always just one clump, always in the middle, and very uniform from one run to the next, which doesn't make for good test data.
I want multiple clumps  1, 2, 3, or 4  and unevenly distributed  just 2 big clumps in one corner; or 4 small clumps all down one side or ...  and different on each run.
Everything mathematical I've tried, tends to produce uniformity, which is kind of it's thing.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] 
Re: Randomly biased, random numbers. by SuicideJunkie (Priest) on Dec 06, 2013 at 19:05 UTC 
After skimming through the thread, my idea is:
 Generate a density gradient image with the clouds render in gimp/photoshop.
 Apply distortions for flavour.
 Save above as a macro so you can generate lots of distributions with one button press.
 Save as greyscale BMP
 For each pixel, convert it to a sample point if brightness > rand(1000)
Note: I presume you don't need an exact number of samples, but if you do, just shuffle the list and then pick the first N.
EG:
http://i.imgur.com/OvPWjWf.png  GIMPed Difference Clouds + more contrast + gamma correction for darker darks
http://i.imgur.com/Di5wzG4.png  Resulting random samples. (about 3000)
 [reply] 
Re: Randomly biased, random numbers. by roboticus (Canon) on Dec 07, 2013 at 15:56 UTC 
BrowserUk:
Here's another idea that occurred to me, so I thought I'd toss it outbuild a function that generates a coordinate and then randomly maps it into a "clump" for you. In order to prevent any of your area from being ignored, there's also a chance that it won't be mapped to a clump:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
sub generate_clumpy_rand {
my $num = shift // int(10*rand()+3);
my @clumps;
print "Generator has $num clumps\n";
for my $i (1 .. $num) {
my @center_size = (rand, rand, rand, rand);
print "clump 1 centered at ($center_size[0],$center_size[1]) "
. "size is $center_size[2]x$center_size[3]\n";
push @clumps, \@center_size;
}
return sub {
my @coord = (rand, rand);
# Ensure we have a background of random dots over entire regio
+n
return @coord if (.1 > rand);
# Choose a clump, and map the coordinate into that clump
my @clump = @{$clumps[rand @clumps]};
return (
($coord[0]*$clump[2])+$clump[0],
($coord[1]*$clump[3])+$clump[1]
);
}
}
my $clump_rand = generate_clumpy_rand(1);
my @coord = $clumpy_rand>();
print Dumper(\@coord);
The good point is that it's simple and fast. It's a proofofconcept, though, and I didn't do any work to ensure that the clumps don't extend beyond the desired range. If you like the idea, then that task is left as an exercise for the reader ;^D.
Update: I forgot to mention: The clumps in this version are rectangular. Changing their shape is possible, but I didn't bother doing that, either.
...roboticus
When your only tool is a hammer, all problems look like your thumb.  [reply] [d/l] 

That's essentially a variation on my "I could generate a point; generate a bunch around that point (using a subrange of the coordinate space, and then fully randomly add a few more, ".
With a few tweaks to constrain the output to the range 01, that does produce definite clumps and some odd ones. the main problem with it is the hard edges to the clumps with no graceful transition.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] 
Re: Randomly biased, random numbers. by GrandFather (Cardinal) on Dec 08, 2013 at 10:04 UTC 
A little late to the party, but maybe interesting. The following code generates a random set of "attractors" which tend to suck near by randomly generated points closer to the attractor. Attractors have a radius which limits their effect. Nearby attractors fight with each other which results in oddly shaped clumping, which is most likely a desirable outcome.
The script above plots results using Tk and includes a little commented out code that was used while tuning the code. At present both the original points and the biased points are plotted.
True laziness is hard work
 [reply] [d/l] 

A little late to the party, but maybe interesting
Thanks for the reply and code.
I played with this a little and it definitely produces clumps. The problem with it (for my purposes) is that it always leaves a background of relatively uniformly distributed points, and I couldn't see any way of preventing that.
For my purpose, I really want clear space around and between the clumps, which the weight map method both achieves and gives fine control over.
However, your "attractors" idea sparked another idea in me  completely unrelated to OP purpose  but intriguing enough to sidetrack me for a day or so.
What happens if you treat a set of clumpy random points as equally sized particles of some finite mass, and have them all exert "gravity" upon each other according to the inverse square of their distance.
Tada. The red dots are the starting position, the cyan their position after a single round of attraction, and the blue, after the second.
Ignore the ones that 'slingshot' off into outer space; the effect of the inverse square law is that in order to get detectable influence over any distance, you have to set the mass of the points quite high, with the consequence that once they get very close to each other, they exert huge forces that results in sling shots because there are no collisions.
Over many generations, the gravity will cause all the points to come together (and then be scattered in all directions), but the first few generations have the effect of concentrating whatever clusters already exist. I wonder if this wouldn't be a fruitful technique for tackling the NPhard clustering problem without resorting making assumptions per the Kmeans type of algorithm?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] 

