<?xml version="1.0" encoding="windows-1252"?>
<node id="1009816" title="Re: How to generate random sequence of UTF-8 characters" created="2012-12-20 18:47:43" updated="2012-12-20 18:47:43">
<type id="11">
note</type>
<author id="925765">
ted.byers</author>
<data>
<field name="doctext">
&lt;p&gt;OK.  Slapping ideas together, and extracting commonalities, from among the responses so far and combining with constructing a random string, I tried the following experiment.&lt;/p&gt;
&lt;code&gt;
use strict;
use warnings;
use Math::Random::MT::Auto::Range;
binmode(STDOUT, ":utf8");

$| = 1;
#the following exhausts all memory
#my $all = join q[], grep /^\w$/, map chr, shuffle 0 .. 0x10FFFF;
#print $all,"\n";
#
# so I tried:
my @chars = map(chr, 0 .. 0x10FFFF);

my $rng = Math::Random::MT::Auto::Range-&gt;new(LO =&gt; 0, HI =&gt; 0x10FFFF,
                                                     TYPE =&gt; 'INTEGER');
# three options for doing the basically the same thing
for (my $i = 0 ; $i &lt; 10 ; $i++) {
  my $code = $rng-&gt;rrand;
  print "$i &lt;=&gt; $code &lt;=&gt;",chr($code),"\n";
  print "\t$i &lt;=&gt; $code &lt;=&gt;",$chars[$code],"\n";
  print "\t$i &lt;=&gt; $code &lt;=&gt;",pack("U",$code),"\n";
}
&lt;/code&gt;&lt;p&gt;There are a couple problems with this.  chr generates some errors and warnings.  There are a couple thousand instances of "UTF-16 surrogate 0xd81b at c:/Work/test.utf8.latin1.pl line 16.", and over a hundred instances of "Unicode non-character 0xfdd0 is illegal for interchange at c:/Work/test.utf8.latin1.pl line 16.", each for a different integer.  Line 16 is the line where @chars is initialized.  I suppose a couple thousand problem characters in @chars is not a huge issue, but given that I want to use it to test functions for converting from utf-8 to latin1 and back, I expect that if they happen to occur in my sample, there'd be some false indications of errors in these functions (from the Perl package 'Encode').&lt;/p&gt;&lt;p&gt;The second problem is that though I put the statement "binmode(STDOUT, ":utf8");" at the start of the script, the printout contains only rectangles and square where the UTF-8 character ought to be; at least when I execute within Emacs.  When I execute the script in the Windows commandline terminal, I invariably get gibberish (different for each character) the width of four characters.  How, then, do I actually see the characters?  I thought, probably mistakenly, I'd see a few greek or sanskrit characters or characters from other alphabets.&lt;/p&gt;&lt;p&gt;What I was thinking of doing is amend the loop I show above to construct a single string from the ten characters produced, and then use the functions from 'Encode' to convert it to latin1, and then back to utf-8, to see if the input string and the output string are the same (and if not, then this whole idea is doomed to fail).  I'd repeat this test for a few million random utf-8 strings, and if there are no failures, then I could use this idea as a temporary measure until I can test our systems to see how best to adapt to use of utf-8 throughout.&lt;/p&gt;&lt;p&gt;What, then, do I do to exclude the integers that either result in a utf-16 surrogate and those that represent illegal characters?  And is it possible for me to actually see the characters produced?&lt;/p&gt;&lt;p&gt;Thanks&lt;/p&gt;&lt;p&gt;Ted&lt;/p&gt;</field>
<field name="root_node">
1009778</field>
<field name="parent_node">
1009778</field>
</data>
</node>
