abdullah.yildiz has asked for the
wisdom of the Perl Monks concerning the following question:
Hi,
I'm a beginner's level Perl user.
I want to generate some test data for comparison of a couple of sorting algorithms.
At first, I want to begin by generating a test set for N integer numbers. I should be able to read them from a file to which I wrote previously.
To generate test data, I do the following:
$array_size = <STDIN>;
chomp ($array_size);
#DEFINE NUMBER RANGE
my $range = $array_size;
open (DATASET, '>dataset.dat');
for ($count = 1; $count <= $array_size; $count++) {
my $random_number = int(rand($range));
print DATASET $random_number;
if($count != $array_size){
print DATASET "\n";
}
}
close (DATASET);
#READ NUMBERS INTO AN ARRAY
open (DATASET, '<dataset.dat');
my @numbers = <DATASET>;
close (DATASET);
my @sorted_numbers = sort { $a <=> $b } @numbers;
My question is that whether this method is a good way to generate the test data and then apply the sort function or is there any wrong thing here?
Thank you for your help.
Re: How to generate test data? by roboticus (Chancellor) on Nov 24, 2012 at 17:21 UTC 
abdullah.yildiz:
I don't see anything immediately wrong. But I'll make a couple of suggestions:
...roboticus
When your only tool is a hammer, all problems look like your thumb.
 [reply] [d/l] 

Thank you for your suggestions.
Yes, I should write the data into a file.
 [reply] 
Re: How to generate test data? by karlgoethebier (Vicar) on Nov 24, 2012 at 18:22 UTC 
Try this:
Update: Perhaps my answer wasn't so helpful to you as i intended, sorry.
You wrote: "At first, I want to begin by generating a test set for N integer numbers..."
So i just wanted to show a way to generate such a randomized test set. I hope very much that i didn't confuse you.
#!/usr/bin/perl
use strict;
use warnings;
#...get a seed
open(RANDOM, "<", "/dev/random")  die $!;
read(RANDOM, $_, 4);
close RANDOM;
srand(unpack("L", $_));
#...do the shuffle
my @k = ( 1..10 );
for ( my $i = @k ; $i ; ) {
my $j = int( rand( $i + 1 ) );
next if $i == $j;
@k[ $i, $j ] = @k[ $j, $i ];
}
print join( " ", @k) . qq(\n);
#...and give your custom sort a chance
Regards, Karl
«The Crux of the Biscuit is the Apostrophe»
 [reply] [d/l] 

Thank you for your answer. I have another problem.
I'm trying to find the data size which causes the algorithm to take at least 15 minutes.
I do the following:
open (DATASET_RANDOM_INTEGER, '<DATASET_RANDOM_INTEGER.dat');
@numbers = <DATASET_RANDOM_INTEGER>;
close (DATASET_RANDOM_INTEGER);
#MEASURE THE TIME DURING WHICH THE ALGORITHM IS PERFORMED
#START
$start = Benchmark>new;
#RUN THE ALGORITHM
@sorted_numbers = sort { $a <=> $b } @numbers;
#FINISH
$end = Benchmark>new;
$diff = timediff( $end, $start );
However, it's too time consuming since for example when I increase the input size to 100 million, I couldn't foresee that how long it will take to finish (As I write these message, it has been running for two hours).
What is the way of accelerating the execution of my code so that it uses more CPU at a time unit?  [reply] [d/l] 

abdullah.yildiz:
Regarding how to choose the size of a dataset to make it take 15 minutes: If I wanted to do that, I'd start out by using progressively larger datasets to see how the time changes with dataset size. For example, look at these three datasets:
Dataset size  Subroutine A  Subroutine B  Subroutine C 
1000  6  1  30 
2000  11  4  40 
3000  17  8  48 
4000  22  16  55 
Once I get a few samples, I'd try to predict the next dataset size. If you look at the values for subroutine A, it looks like a simple linear progression: it looks like it handles about 160ish items per second for all four dataset sizes. So if I wanted to make it run for 15 minutes, I'd expect it to take 15*60*160 data items. Subroutine B, however isn't linear. It looks like it gets slower and slower as the dataset increasesin this case it takes roughly T = (X/1000)^2 seconds for a dataset. Solve for X when T=15*60 seconds and that would be a reasonable prediction. The third subroutine starts out pretty slow, but you can see that the time it consumes changes less and less as you add data samples. (I was shooting for a logarithmic progression, but I don't feel like doing the math, so that one's left as an exercise for the reader!)
*HOWEVER*, these predictions assume that everything else will remain the same as the dataset grows. But you may find that at a certain dataset size, an algorithm may take a sudden, drastic increase in the time it takes. (For example you might exhaust your main memory and the OS may start swapping.) So rather than immediately going for 15 minutes, you might try to predict a dataset size that would take less time, like one or two minutes and see how far off you are. I frequently approach a final value by doubling each time (unless I'm using something like subroutine B).
I hope this is somewhat helpful...
Modern computers are so fast, though, that I expect it'll take a pretty large dataset to consume 15 minutes. (That, or a sufficiently horrible sort algorithm.)
...roboticus
When your only tool is a hammer, all problems look like your thumb.
 [reply] 


