Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Random data generation.

by BrowserUk (Patriarch)
on Jun 26, 2010 at 02:13 UTC ( [id://846630]=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I want to generate strings of length N, draw from a set of M chars, but with no more than two of any given char consecutively.

Examples:

set ABCDEF; length 12; Valid: ABCDEFABCDEF AABBCCDDEEFF Invalid: AAABCDEFABCD

I don't know whether this is tougher than it first appears, or I'm just having a bad head day. but my attempts so far involve cumbersome loops with the setting and resetting of myriad flags.

Any thoughts on a clean way to do this?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: Random data generation.
by ikegami (Patriarch) on Jun 26, 2010 at 02:37 UTC

    This can be solved simply by viewing the set as a circular list. Just pick one of the next N-1 characters.

    # No two repeated characters my @set = qw( a b c d e f ); my $len = 12; my $s = ''; my $i = int(rand(@set)); for (1..$len) { $s .= $set[$i]; $i = ( $i + int(rand($#set)) + 1 ) % @set; } print("$s\n");
    dbadfbabfbcd deabfadaeafb edbaceabacad

    Update: Fixed bugs, then I realized I misread the question. On the plus side, the solution can easily be adapted to solve the actual problem.

    # No more than two repeated characters my @set = qw( a b c d e f ); my $len = 12; my $s = ''; my $i; for (1..$len) { if ($s =~ /(.)\1\z/) { $i = ( $i + int(rand($#set)) + 1 ) % @set; } else { $i = int(rand(@set)); } $s .= $set[$i]; } print("$s\n");
    aadbadfafbfb dbfaefababea efeabbcebbee
Re: Random data generation.
by jwkrahn (Abbot) on Jun 26, 2010 at 12:41 UTC
    $ perl -le' use List::Util qw/ shuffle /; my @chars = qw/ A B C D E F /; my $length = 12; my $repeats = 2; my @set = shuffle( ( @chars ) x $repeats ); die "Error!\n" if @set < $length; print join "", @set[ 0 .. $length - 1 ]; ' FCBECFADABED

      That's the one. For the life of me I don't see how it works yet, but it does. Many thanks.

        As there are only two of each character in the set which is shuffled ((@chars) x 2), there can of course also only occur two in a row at maximum.

        OTOH, this technique doesn't allow more than two of any character in the result (like AACBAFFDCAEC), so it might introduce an unwanted bias... (you didn't explicitly say whether this is desired or not, and it cannot be inferred from your valid samples).

        It's not random. Not even close. For example, you'll never have three of the same character anywhere in the string.

        I don't know if it matters, but it only works if $len <= @set*2

Re: Random data generation.
by AnomalousMonk (Archbishop) on Jun 26, 2010 at 05:57 UTC

    ikegami's is a general solution for all cases, but would it be sufficient in non-extreme cases (i.e., not trying to generate 100-character strings from just two different characters and with no more than two sequential repeats) to just generate bunches of strings and throw away all that are non-compliant?

    >perl -wMstrict -le "use List::Util qw(shuffle); my @set = qw(a b c d e f); my $len = 12; my $needed = shift; my @out; while (@out < $needed) { my $str = join '', shuffle((@set) x $len); push @out, grep !m{ (.)\1\1 }xms, $str =~ m{ .{$len} }xmsg; } @out = @out[-$needed .. -1]; m{ (.)\1\1 }xms and die qq{3+ same: '$_'} for @out; printf qq{%3d: '$out[$_]' \n}, 1+$_ for 0 .. $#out; " 5 1: 'becaecdbbedc' 2: 'afccbecefdca' 3: 'fbffecacfeaa' 4: 'ddbacddfafca' 5: 'fbebdabadedc'

      You have it backwards. The smaller the set and the longer the string, the higher the chance of producing something that needs to be thrown away. What you call the non-extreme case is where your algorithm has the most problems.

      The real problem is that your solution isn't random. The chance of picking a characters at a certain position is affected by the characters picked in previous positions.

        The smaller the set and the longer the string, the higher the chance of producing something that needs to be thrown away.

        I thought that that, in essence, was implied by what I wrote.

        What you call the non-extreme case is where your algorithm has the most problems.

        I take 'problems' to mean the generation of strings that don't meet the max-repeated-characters requirement and so must be thrown away.

        So, if my algorithm was trying to generate strings of length 3 from a character set of 100 characters (my idea of a very 'non-extreme' case), which in the totally random case (OK, OK, my algorithm is only approximately random – but see below) would result about 1 in a million times in a string of three identical characters that needed to be discarded, this would be more problematic than trying to generate strings of, say, length 100 from a set of three characters? I don't understand.

        The real problem is that your solution isn't random.

        I agree it isn't completely random, but, as others have noted, it wasn't clear from the OP and subsequent discussion that complete randomness was required. I was aiming for something quick and dirty that would satisfy the max-repeated-characters requirement while still being sort of random-ish.

Re: Random data generation.
by bluescreen (Friar) on Jun 26, 2010 at 02:59 UTC

    Is this what you looking for?

    #!/usr/bin/perl use strict; my @set = ('A', 'B', 'C', 'D', 'E', 'F'); my $length = 12; my $result; my $last_char; while(length($result) < $length) { my $char = $set[int(rand()*(scalar(@set)-1))]; next if ( $last_char eq $char); $result .= $char; $last_char = $char; } print $result;

      Your approach could be made to work by simply checking the last two characters:

      my @set = qw( A B C D E F ); my $len = 12; my $result = ''; while (length($result) < $len) { my $char = $set[ rand @set ]; next if substr($result, -2) eq $char x 2; $result .= $char; } print $result;

      Also, there's no need to use an extra variable (like $last_char), as $result already stores all the previous characters.

        Absolutely it was a quick-&-dirty script, probably to much :D, I knew that instead of $last_char I could have used substr(), but premature optimization made me do that.

        Anyway this sort of social coding brought an interesting result

      You made the same error I did originally, you attempted to solve "no repeating characters" rather than "no more than two repeating characters".

      Furthermore, your code never uses the last character of the set.

      Not quite. I have to allow two consecutive chars, but not tripled or more...

Re: Random data generation.
by salva (Canon) on Jun 26, 2010 at 15:40 UTC
    my @set = split //, "ABCDEF"; my $len = 12; my $max_reps = 2; say gen(\@set, $len, $max_reps) for (1..20); sub gen { my ($set, $len, $max_reps) = @_; my $out = ''; my $last_ix = -1; my $reps = 0; for (1..$len) { my $ix; if ($reps >= $max_reps) { $ix = int rand(@$set - 1); $ix++ if $ix >= $last_ix; } else { $ix = int rand(@$set); } if ($ix == $last_ix) { $reps++; } else { $last_ix = $ix; $reps = 1; } $out .= $set->[$ix]; } $out; }

      Yep. That's pretty much exactly what I came up with for my first attempt. Though I think I needed one more flag.

Re: Random data generation.
by johngg (Canon) on Jun 26, 2010 at 23:12 UTC

    This seems to work, if I've understood correctly.

    use strict; use warnings; use 5.010; say genString( 12, ( q{a} .. q{f} ) ); say genString( 20, ( q{a} .. q{c} ) ); say genString( 20, ( q{a} .. q{b} ) ); sub genString { my( $len, @set ) = @_; my $string = q{}; for ( 1 .. $len ) { if ( $string =~ m{(.)\1$} ) { my $let; do { $let = $set[ int rand @set ] } until $let ne $1; $string .= $let; } else { $string .= $set[ int rand @set ]; } } return $string; }

    The output.

    bceffcedefda bccbbcaabbccacabbcaa abaabbaabaabababbabb

    You would need to add a test to prevent an endless loop if given a set with only one member.

    I hope this is of interest.

    Cheers,

    JohnGG

      A working solution, but like others that invoke the regex engine, it proves slower, especially for longer strings, than a flags & conditions solution. (Eg. salva's 846681).

      [16:16:52.15] c:\test>junk -I=3 ABCDEF 12 Rate x0 x3 x8 x4 x6 x5 x7 x0 9284/s -- -7% -16% -33% -38% -38% -47% x3 9946/s 7% -- -10% -29% -33% -34% -44% ikegami x8 11091/s 19% 12% -- -20% -26% -26% -37% johngg x4 13951/s 50% 40% 26% -- -7% -7% -21% ikegami variant x6 14952/s 61% 50% 35% 7% -- -0% -15% x5 14964/s 61% 50% 35% 7% 0% -- -15% x7 17647/s 90% 77% 59% 26% 18% 18% -- salva [16:17:13.48] c:\test>junk -I=3 ABCDEF 120 Rate x0 x3 x4 x8 x6 x5 x7 x0 279/s -- -8% -10% -78% -83% -83% -91% x3 302/s 8% -- -3% -76% -81% -81% -90% x4 310/s 11% 3% -- -75% -81% -81% -89% x8 1250/s 348% 314% 303% -- -22% -22% -58% x6 1595/s 472% 428% 414% 28% -- -0% -46% x5 1597/s 472% 429% 415% 28% 0% -- -46% x7 2951/s 958% 877% 851% 136% 85% 85% -- [16:17:30.86] c:\test>junk -I=3 ABCDEF 1200 Rate x0 x4 x3 x8 x6 x5 x7 x0 3.38/s -- -1% -3% -97% -98% -98% -99% x4 3.42/s 1% -- -1% -97% -98% -98% -99% x3 3.46/s 3% 1% -- -97% -98% -98% -99% x8 121/s 3478% 3433% 3388% -- -24% -24% -63% x6 159/s 4608% 4548% 4489% 32% -- -1% -51% x5 160/s 4633% 4573% 4513% 32% 1% -- -51% x7 324/s 9499% 9377% 9256% 168% 104% 103% --

      As you can see. salva's solution wins easily on performance.

        I wonder if you've overlooked my solution, or if there's anything wrong with it.

        Not only is it simpler (both less lines of code and conceptually easier to understand, IMHO), it's also consistently faster than what you found to be the fastest:

        $ ./846784.pl Rate salva almut salva 881/s -- -15% almut 1037/s 18% --

        (tested with v5.10.1, x86_64-linux-thread-multi)

Re: Random data generation.
by Khen1950fx (Canon) on Jun 26, 2010 at 04:01 UTC
    I tried it with Test::LectroTest::Generator.
    #!/usr/bin/perl use strict; use warnings; use Test::LectroTest::Generator qw(:common :combinators); my $str_gen = String( charset => "a-f", length => [12,] ); for my $string_guidance (1) { my $i = $str_gen->generate($string_guidance); print "$i "; } print "\n";
    Update: Trying to get a work-around for the constraint, I kept coming across warnings that the length cannot exceed the number of chars in the set. If I do length => [6,], then the constraint is automatic, and it works.

      Is that in any way constrained to having no more than two consecutive repeats?

Re: Random data generation.
by davies (Prior) on Jun 26, 2010 at 16:45 UTC
    This is actually a classic algorithm used for things like dealing a pack of cards. If you are dealing bridge hands, for example, you want 52 different cards with no repeats. Some games use multiple packs and/or cut down packs (Bézique uses a pack of 2 decks, A down to 7). You therefore set up a pack with all the possible cards, deal one at random, move the last card to the position from which the card was taken & reduce the pack by one. So, for your problem:
    use strict; use warnings; my $nRepeats = 2; my @sSet = qw (a b c d e f); my $nLength = 12; my @sAll; for (1..$nRepeats) { push (@sAll, @sSet); } if ($nLength - 1 > $#sAll) {die "Don't be silly"} my $sString; for (1..$nLength) { my $i = int(rand(@sAll)); $sString .= $sAll[$i]; $sAll[$i] = $sAll[-1]; pop(@sAll); } print "$sString \n";
    Now, I'm a beginner. So if anyone sees anything that could be improved (apart from Hungarian notation - it helps me conceptualise strong typing), I'd love to know.

    Regards,

    John Davies

    Update: Re-reading your question, if AABAABAABAAB is valid, my code won't do what you want, but I'm not clear on whether that's what you mean.
Re: Random data generation.
by Limbic~Region (Chancellor) on Jun 27, 2010 at 16:30 UTC
    BrowserUk,
    I haven't read any of the other solutions but this is the first idea that popped into my head. I haven't really tested it nor is it without at least 1 assumption.
    #!/usr/bin/perl use strict; use warnings; my @M = 'a' .. 'z'; my $N = 42; # Assumes N is > 2 my ($minus_1, $minus_2) = ($M[rand @M], $M[rand @M]); my $str = $minus_1 . $minus_2; for (1 .. $N - 2) { my $idx = rand @M; $idx = ($idx + (int(rand $#M) + 1)) % @M if $minus_1 eq $minus_2 & +& $M[$idx] eq $minus_2; $str .= $M[$idx]; ($minus_1, $minus_2) = ($minus_2, $M[$idx]); } print "$str\n";
    Update: Now after reading the solutions, I realize mine is almost identical to ikegami's. I chose to use scalars to hold the value of the last two chars instead of the regex, but very similar indeed.

    Update 2: I cleaned up the code a bit though the algorithm stayed the same. I also fixed the bugs as shown below.

    Cheers - L~R

      Unless I've broken something by subroutinising it, there is a flaw in your algorithm:

      #!perl -slw use strict; use List::Util qw/ shuffle /; sub x { my( $N, @M ) = @_; my ($minus_1, $minus_2, $str) = (('') x 3); $minus_1 = $M[rand @M]; $minus_2 = $M[rand @M]; $str = $minus_1 . $minus_2; for (1 .. $N - 2) { my $next_char; my $idx = int(rand @M); if ($minus_1 eq $minus_2) { $idx = ($idx + int(rand @M) + 1) % @M; } $str .= $next_char = $M[$idx]; ($minus_1, $minus_2) = ($minus_2, $next_char); } return $str; } for( 1 .. 10 ) { my $x = x( 12, qw[ A B C D E F ] ); print $x, $x =~ m[(.)\1\1] ? ' Fail' : ' Ok'; } __END__ [17:39:11.61] c:\test>junk3 BAADDBBCADEF Ok ADAEFADBCEEA Ok CECBAABFCEFE Ok CDDCFDAABEBF Ok ADFBFCEADFFF Fail DAABAAFDCCCE Fail ABAADCBDFBBB Fail FDDDBEEADBCC Fail CEDCDDFCEEFA Ok FCBECCCEAFEB Fail
        BrowserUk,
        Yes, there were at least 2 bugs. Try this instead.
        #!perl -slw use strict; use List::Util qw/ shuffle /; sub x { my( $N, @M ) = @_; my ($minus_1, $minus_2) = ($M[rand @M], $M[rand @M]); my $str = $minus_1 . $minus_2; for (1 .. $N - 2) { my $idx = rand @M; $idx = ($idx + (int(rand $#M) + 1)) % @M if $minus_1 eq $minus +_2 && $M[$idx] eq $minus_2; $str .= $M[$idx]; ($minus_1, $minus_2) = ($minus_2, $M[$idx]); } return $str; } for( 1 .. 100 ) { my $x = x( 12, qw[ A B C D E F ] ); print $x, $x =~ m[(.)\1\1] ? ' Fail' : ' Ok'; }

        Cheers - L~R

Re: Random data generation.
by Khen1950fx (Canon) on Jun 27, 2010 at 12:03 UTC
    I tried a variation on a card shuffle.
    #!/usr/bin/perl use strict; use warnings; my @random = (qw/a b c d e f/) x 2; foreach my $i (reverse 0..$#random) { my $r = int rand ($i+1); @random[$i, $r] = @random[$r, $i] unless ($i == $r); } print @random, "\n";

      This suffers the same limitation as jwkrahn (that initially had me fooled), in that:

      1. it can only deal with strings up to twice the length of the input set;

        Whilst I gave 2 & 12 as an examples of M & N, I thought it was clear that other values were possible.

      2. it will only generate a small fraction of the possible legal values;

        Even at length 12, it won't generate abbabbabbabb.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Random data generation.
by Sewi (Friar) on Jun 28, 2010 at 08:10 UTC
    What's about
    my @chars = (A..F); my @strings; while ($#strings < 5) { my $random = join('',map { $chars[int(rand(scalar(@chars)))]; } (1..1 +2)); $random =~ /(.)\1{2}/ and next; push @strings, $random; }

      Generating the entire string before you check becomes highly inefficient for long strings, because the likelihood to have a violation approaches 1.0 as strings get longer, so you'll throw away almost every sample you generated...

Re: Random data generation.
by davies (Prior) on Jun 29, 2010 at 12:18 UTC
    Another go, seeing that my previous attempt didn't do what you wanted. I don't see anyone else trying this algorithm, but it may have been done by someone who can write better Perl than I can.
    use strict; use warnings; my $nRepeats = 2; my @sSet = qw (a b c d e f); my $nLength = 12; my %sSets; for my $nKey (0..((@sSet) ** $nRepeats - 1)) { my $sKey = ""; for (1..$nRepeats) { $sKey .= $sSet[($nKey / (($#sSet + 1) ** ($_ - 1))) % ($#sSet ++ 1)]; } $sSets{$sKey} = [@sSet]; } for (0..$#sSet) { splice(@{$sSets{$sSet[$_] x $nRepeats}}, $_, 1); } my $sString; my @sChr; for (1..$nRepeats) { my $sRnd = $sSet[int(rand(@sSet))]; $sString .= $sRnd; push(@sChr, $sRnd); } for ($nRepeats + 1 .. $nLength) { my $sKey; for (@sChr) { $sKey .= $_; } my $sRnd = @{$sSets{$sKey}}[int(rand(@{$sSets{$sKey}}))]; $sString .= $sRnd; push(@sChr, $sRnd); shift(@sChr); } print "$sString \n";
    The algorithm was developed on one of the hot nights when you couldn't sleep, either. The Perl has taken me 2 days to write with all the other distractions life affords. I learned some useful stuff writing it (thanks, Tye, for helping me in CB), so I'm putting it up even though you have your solution!

    Regards,

    John Davies

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://846630]
Approved by ikegami
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-19 21:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found