Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re^3: Rosetta Code: Long List is Long (Test File Generators)

by eyepopslikeamosquito (Archbishop)
on Dec 09, 2022 at 09:09 UTC ( [id://11148681]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Rosetta Code: Long List is Long
in thread Rosetta Code: Long List is Long

Thanks anonymonk. Excellent work!

Though I've never used them, I've heard good things about Judy Arrays and maintain a list of references on them at PM. Might get around to actually using them one day. :)

What if "words" are significantly longer? With approx. 10e6 unique words in this test, if they were each hundreds of bytes, then several GB of RAM would be used just to keep them. Perhaps impractical.

Good question! Apologies, my initial test file generator was very primitive. To try to help answer your question I've quickly whipped up a test file generator that generates much longer keys (up to around 200 characters in length) and longer counts too. I was conservative with the counts because I didn't want to disqualify folks using 32-bit ints.

# gen-long-llil.pl # Crude program to generate a LLiL test file with long names and count +s # perl gen-long-llil.pl long1.txt 600 use strict; use warnings; use autodie; { my $ordmin = ord('a'); my $ordmax = ord('z') + 1; # Generate a random word sub gen_random_word { my $word = shift; # word prefix my $nchar = shift; # the number of random chars to append for my $i (1 .. $nchar) { $word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) ); } return $word; } } my $longworda = join '', 'a' .. 'z'; my $longwordz = join '', reverse('a' .. 'z'); my $longcount = 1_000_000; sub create_long_test_file { my $fname = shift; my $howmany = shift; open( my $fh_out, '>', $fname ); # Some with no randomness for my $h ( 1 .. $howmany ) { for my $i ( 1 .. 8 ) { my $cnt = $longcount + $i - 1; my $worda = $longworda x $i; my $wordz = $longwordz x $i; print {$fh_out} "$worda\t$cnt\n$wordz\t$cnt\n"; } } # Some with randomness my $wordlen = 1; for my $h ( 1 .. $howmany ) { for my $i ( 1 .. 8 ) { my $cnt = $longcount + $i - 1; my $worda = $longworda x $i; my $wordz = $longwordz x $i; for my $c ( 'a' .. 'z' ) { for my $z ( 1 .. 2 ) { print {$fh_out} $worda . gen_random_word( $c, $wordlen +) . "\t" . (1000000 + $z) . "\n"; print {$fh_out} $wordz . gen_random_word( $c, $wordlen +) . "\t" . (1000000 + $z) . "\n"; } } } } } my $outfile = shift; my $count = shift; $outfile or die "usage: $0 outfile count\n"; $count or die "usage: $0 outfile count\n"; $count =~ /^\d+$/ or die "error: count '$count' is not a number\n"; print "generating short long test file '$outfile' with count '$count'\ +n"; create_long_test_file( $outfile, $count ); print "file size=", -s $outfile, "\n";

I ran it like this:

> perl gen-long-llil.pl long1.txt 600 generating short long test file 'long1.txt' with count '600' file size=65616000 > perl gen-long-llil.pl long2.txt 600 generating short long test file 'long2.txt' with count '600' file size=65616000 > perl gen-long-llil.pl long3.txt 600 generating short long test file 'long3.txt' with count '600' file size=65616000

Then reran my two biggish benchmarks with a mixture of files:

> perl llil2d.pl big1.txt big2.txt big3.txt long1.txt long2.txt long3. +txt >perl2.tmp llil2d start get_properties : 11 secs sort + output : 23 secs total : 34 secs > llil2a big1.txt big2.txt big3.txt long1.txt long2.txt long3.txt >cpp +2.tmp llil2 start get_properties : 6 secs sort + output : 5 secs total : 11 secs > diff cpp2.tmp perl2.tmp

Improved test file generators welcome.

Updated Test File Generators

These were updated to allow a "\n" (rather than "\r\n") on Windows after this was pointed out here. Curiously, \n seems to be slower than \r\n on Windows if you don't set binmode! I am guessing that chomp is slower with \n than with \r\n on a Windows text stream.

gen-llil.pl

# gen-llil.pl # Crude program to generate a big LLiL test file to use in benchmarks # On Windows running: # perl gen-llil.pl big2.txt 200 3 - produces a test file with size + = 35,152,000 bytes # (lines terminated with "\r\n") # perl gen-llil.pl big2.txt 200 3 1 - produces a test file with size + = 31,636,800 bytes # (lines terminated with "\n") # On Unix, lines are terminated with "\n" and the file size is always +31,636,800 bytes use strict; use warnings; use autodie; { my $ordmin = ord('a'); my $ordmax = ord('z') + 1; # Generate a random word sub gen_random_word { my $word = shift; # word prefix my $nchar = shift; # the number of random chars to append for my $i (1 .. $nchar) { $word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) ); } return $word; } } sub create_test_file { my $fname = shift; my $count = shift; my $wordlen = shift; my $fbin = shift; open( my $fh_out, '>', $fname ); $fbin and binmode($fh_out); for my $c ( 'aaa' .. 'zzz' ) { for my $i (1 .. $count) { print {$fh_out} gen_random_word( $c, $wordlen ) . "\t" . 1 . +"\n"; } } } my $outfile = shift; my $count = shift; my $wordlen = shift; my $fbin = shift; # default is to use text stream (not a binary +stream) defined($fbin) or $fbin = 0; $outfile or die "usage: $0 outfile count wordlen\n"; $count or die "usage: $0 outfile count wordlen\n"; print "generating test file '$outfile' with count '$count' (binmode=$f +bin)\n"; create_test_file($outfile, $count, $wordlen, $fbin); print "file size=", -s $outfile, "\n";

gen-long-llil.pl

# gen-long-llil.pl # Crude program to generate a LLiL test file with long names and count +s # perl gen-long-llil.pl long1.txt 600 # On Windows running: # perl gen-long-llil.pl long1.txt 600 - produces a test file with s +ize = 65,616,000 bytes # (lines terminated with "\r\n +") # perl gen-long-llil.pl long1.txt 600 - produces a test file with s +ize = 65,107,200 bytes # (lines terminated with "\n") # On Unix, lines are terminated with "\n" and the file size is always +65,107,200 bytes use strict; use warnings; use autodie; { my $ordmin = ord('a'); my $ordmax = ord('z') + 1; # Generate a random word sub gen_random_word { my $word = shift; # word prefix my $nchar = shift; # the number of random chars to append for my $i (1 .. $nchar) { $word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) ); } return $word; } } my $longworda = join '', 'a' .. 'z'; my $longwordz = join '', reverse('a' .. 'z'); my $longcount = 1_000_000; sub create_long_test_file { my $fname = shift; my $howmany = shift; my $fbin = shift; open( my $fh_out, '>', $fname ); $fbin and binmode($fh_out); # Some with no randomness for my $h ( 1 .. $howmany ) { for my $i ( 1 .. 8 ) { my $cnt = $longcount + $i - 1; my $worda = $longworda x $i; my $wordz = $longwordz x $i; print {$fh_out} "$worda\t$cnt\n$wordz\t$cnt\n"; } } # Some with randomness my $wordlen = 1; for my $h ( 1 .. $howmany ) { for my $i ( 1 .. 8 ) { my $cnt = $longcount + $i - 1; my $worda = $longworda x $i; my $wordz = $longwordz x $i; for my $c ( 'a' .. 'z' ) { for my $z ( 1 .. 2 ) { print {$fh_out} $worda . gen_random_word( $c, $wordlen +) . "\t" . (1000000 + $z) . "\n"; print {$fh_out} $wordz . gen_random_word( $c, $wordlen +) . "\t" . (1000000 + $z) . "\n"; } } } } } my $outfile = shift; my $count = shift; my $fbin = shift; # default is to use text stream (not a binary +stream) defined($fbin) or $fbin = 0; $outfile or die "usage: $0 outfile count\n"; $count or die "usage: $0 outfile count\n"; $count =~ /^\d+$/ or die "error: count '$count' is not a number\n"; print "generating short long test file '$outfile' with count '$count' +(binmode=$fbin)\n"; create_long_test_file( $outfile, $count, $fbin ); print "file size=", -s $outfile, "\n";

Updated this node with new test file generators so you can generate test files that are the same size on Unix and Windows. That is, by setting $fbin you can make the line ending "\n" on Windows, instead of "\r\n". See Re^2: Rosetta Code: Long List is Long (faster) for more background.

Replies are listed 'Best First'.
Re^4: Rosetta Code: Long List is Long
by Anonymous Monk on Dec 09, 2022 at 10:14 UTC

    Thanks for paying attention to my doubts, perhaps I wasn't very clear. What I meant was total length of unique words i.e. hash keys. Would be roughly equal to size of output file, which is almost the same for both original test and parent node. I don't think it's worth the effort to create simulation with a few GB output file.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148681]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-19 13:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found