Re: improving the efficiency of a script
by Zaxo (Archbishop) on Jun 18, 2006 at 16:48 UTC
|
If this is going to be used a lot, I'd either stuff the dictionary file into a database, or else construct an index to the offsets and sizes of initial letter sections of the dictionary file.
Assuming the dictionary file is alphabetically sorted, you don't need to slurp the whole file into an array. That is a large chunk of memory for a million words. Allocations that size will slow you painfully if you are driven into swap.
Try just building an array with the a's, shuffling, and taking the first hundred elements. Then discard the a's and replace with the b's, all in a while loop that only reads one line at a time.
You don't need a loop to pick the first hundred elements of an array. A slice will do,
@array[0..99]
and is much faster.
| [reply] [Watch: Dir/Any] [d/l] |
|
Zaxo,
This is very similar to the idea I had. I discovered that compiling the offsets using DBM::Deep was extremely slow, but was fast for subsequent runs. This also has the advantage of not requiring the dictionary file to be sorted.
#!/usr/bin/perl
use strict;
use warnings;
use DBM::Deep;
open(my $dict, '<', 'words.raw') or die "Unable to open 'words.raw' fo
+r reading: $!";
my $db = DBM::Deep->new("offsets.db");
build_db($db, $dict) if ! scalar keys %$db;
for my $char ('a' .. 'z') {
for (1 .. 100) {
print get_rand_word($db, $char, $dict);
}
}
sub build_db {
my ($db, $dict) = @_;
my $pos = tell $dict;
while ( <$dict> ) {
my $char = substr($_, 0, 1);
push @{$db->{$char}}, $pos;
$pos = tell $dict;
}
}
sub get_rand_word {
my ($db, $char, $dict) = @_;
my $offset = $db->{$char}[rand @{$db->{$char}}];
seek $dict, $offset, 0;
my $word = <$dict>;
return $word;
}
Other options include Storable and DBD::SQLite if a real RDBMS isn't available.
| [reply] [Watch: Dir/Any] [d/l] |
Re: improving the efficiency of a script
by lima1 (Curate) on Jun 18, 2006 at 16:32 UTC
|
| [reply] [Watch: Dir/Any] |
|
#!/usr/bin/perl -w
# Strict
use strict;
use warnings;
# User-defined
my $wordfile = "words.txt";
# Libraries
use FileHandle;
use File::Basename;
use Data::Dumper;
# Main program
my $iam = basename $0;
# Read words, saving file offsets
my $fh = new FileHandle;
open($fh, '<', $wordfile) or die "$iam: can't read $wordfile ($!)\n";
my ($word, %word_offsets_by_letter);
while (1) {
my $offset = tell($fh);
defined($word = <$fh>) or last;
chomp $word;
if ($word =~ /^(.)/) {
my $first_letter = lc $1;
$word_offsets_by_letter{$first_letter} ||= [ ];
push @{$word_offsets_by_letter{$first_letter}}, $offset;
}
}
# Test (this will give 100 random words beginning with 'a')
foreach (1..100) {
my $next = random_word_starting_with("a", $fh);
print "Next word => $next\n";
}
# Subroutines
#
# random_word_starting_with
#
# In: $1 ... the first letter of the word (eg. 'a', 'b', 'c', etc.)
# $2 ... the open filehandle of the word file
#
sub random_word_starting_with {
my ($first_letter, $fh) = @_;
my $p = $word_offsets_by_letter{lc $first_letter};
my $offset = $p->[int rand @$p];
seek($fh, $offset, 0) or die "$iam: failed to seek ($!)\n";
my $word = <$fh>;
chomp $word;
return $word;
}
And if this is something you need to do a lot of, I like Zaxo's suggestion of using a database.
s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
| [reply] [Watch: Dir/Any] [d/l] |
|
As for binary searching for first letters in a sorted dictionary, I've done that in my submission to the word ladder problem which was a problem in perl quiz of the week expert edition week 22.
However, I don't quite see how that would help in this word selection problem. Unless you somehow preprocess the dictionary, you cannot select a random word in constant time even if you know the offsets of the first letters, as just choosing a random offset favors long words. So, IMO, you have to read the whole or most of the dictionary file for this problem unless it's preprocessed in some way. Random access to the file may still slightly help after reading it completely, but not very much unless there are very long words in the "dictionary".
| [reply] [Watch: Dir/Any] |
Re: improving the efficiency of a script (random sample)
by ambrus (Abbot) on Jun 18, 2006 at 20:46 UTC
|
I show a one-pass solution to this problem using the combinatorical algorithm. Here one-pass means that you need only O(gm) memory if you want to print g words and the maximal word length is m, you have to read the file only once and don't even know the number of words in the dictionary in advance. Apart from this, I don't take emphasis on that the algorithm doesn't take too much computation time. That could also be easily done (while still keeping the previous efficency conditions true). For that, see algorithm R in chapter 3.4.2 in Knuth, but I leave the implementation as an exercise to the reader.
You didn't say if there's any requirement on the order of the words printed, so I assume it can be anything (whatever is simplest to implement). I'll also assume that if there's fewer than 100 words starting with a certain letter, we have to print all of them. And naturally assume the usual disclaimer for the code: I put this together fast and it may have errors.
As a simpler example, I first show how to just select 100 words uniformly randomly from a dictionary, independently of first letters.
use warnings; use strict; my $g = 100; my @c; my $n = 0; while(<>) { i
+f (rand() < $g / ++$n) { splice @c, int(rand(@c)), $g <= @c, $_; } }
+print for @c;
Now doing this for every letter we get this:
use warnings; use strict; my $g = 100; my %c; my %n; while(<>) { my $l
+ = /(.)/ && lc($1); my $c = \@{$c{$l}}; if (rand() < $g / ++$n{$l}) {
+ splice @$c, int(rand(@$c)), $g <= @$c, $_; } } print @$_ for values(
+%c);
Update. Another one-pass solution would be to use heaps. You create a heap for each letter, add words as you read them to the corresponding heap using a random number as priority, and popping an element if the heap is larger than 100. I guess that this would be less CPU-efficent as the above mentioned good algorithm in Knuth if well implemented.
Update 2008 oct 9: see also Randomly select N lines from a file, on the fly.
Update 2009-12-26: see also Random sampling a variable record-length file. which – by the time you look there – should have some good solutions as well.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: improving the efficiency of a script
by McDarren (Abbot) on Jun 18, 2006 at 16:48 UTC
|
My dictionary (/usr/share/dict/words) only has 96274 words :)
But anyway, here is my go at this - I'm not sure how efficient (or otherwise) it is - but it seems to work :)
Update 1: blah - I just realised that it doesn't work at all - the shuffle isn't doing anything. I'll update it again later when I get it working.
Update 2: - okay, working now :) (I think)
#!/usr/bin/perl -w
use strict;
use List::Util qw(shuffle);
my $dict = '/usr/share/dict/words';
open DICT, "<", $dict or die "Cannot open $dict:$!\n";
my %words;
WORD:
while (my $word = <DICT>) {
chomp $word;
my ($letter) = $word =~ /^(\w)/;
next WORD if !$letter;
push @{$words{$letter}}, $word;
}
close DICT;
foreach my $letter (sort keys %words) {
@{$words{$letter}} = shuffle(@{$words{$letter}});
for my $number (1 .. 100) {
print "$words{$letter}[$number]\n";
}
}
Cheers,
Darren :) | [reply] [Watch: Dir/Any] [d/l] |
|
I'm not sure how efficient (or otherwise) it is
Much better than OPs solution ;)
but time and space requirements for the data preparation is still O(n). much more than 0(m * log(n)) and O(m), respectively (once again, for sorted dictionaries only).
So if dictionary is unsorted, the memory requirements are not a problem and if this task is not often repeated, yours is the best solution IMHO.
| [reply] [Watch: Dir/Any] |
Re: improving the efficiency of a script
by sh1tn (Priest) on Jun 18, 2006 at 16:48 UTC
|
my @words = <DICT>;
my %data;
/^(\w)/ and push @{$data{$1}}, $_ for @words;
for my $letter (sort keys %data){
for my $number (1..100){
my $rand = int rand @{$data{$letter}};
print "Letter: $letter\tnumber: $number\trand word: ",
$data{$letter}[$rand];
}
}
| [reply] [Watch: Dir/Any] [d/l] |
Re: improving the efficiency of a script
by TedPride (Priest) on Jun 19, 2006 at 04:42 UTC
|
This really takes two passes unless you want to load the whole thing into memory. The first pass counts how many words there are corresponding to each letter, and the second pass retrieves the chosen words from the file.
use strict;
use warnings;
my ($words, $file, $handle, %range, %c, %n, $c, @words);
$words = 100;
$file = 'dictionary.txt';
$range{$_} = () for 'a'..'z';
open ($handle, $file);
while (<$handle>) {
$c = substr($_, 0, 1);
next if !exists $range{$c};
$c{$c}++;
}
close ($handle);
for (values %c) {
choose($_);
}
open ($handle, $file);
while (<$handle>) {
$c = substr($_, 0, 1);
next if !exists $c{$c} || $n{$c}++ < $c{$c}[-1];
chomp; push @words, $_;
pop @{$c{$c}};
delete $c{$c} if $#{$c{$c}} == -1;
}
close ($handle);
print join "\n", sort @words;
### Pick random numbers in range
sub choose {
my @c = 0..($_[0]-1);
for (0..($words-1)) {
swap(\@c, $_, rand ($_[0] - $_) + $_);
}
$_[0] = [sort {$b <=> $a} @c[0..($words-1)]];
}
### Swap two array items
sub swap {
my ($r, $x, $y, $t) = @_;
$t = $r->[$x];
$r->[$x] = $r->[$y];
$r->[$y] = $t;
}
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
|
It doesn't necessarily require two passes.It seems reasonable to assume that a dictionary file is sorted case-insensitively (and simple to make it so if it isn't already). Then you can read the file building up a list of words beginning with letter 'a'. As soon as you come across the first word beginning with 'b', make your random selection of 100 words beginning with 'a'and then discard the word list and re-initialise it with the 'b' word just read. Repeat until you've gone through the alphabet.
That way the job is done with a single pass but you don't have to keep the entire dictionary in memory, just one letter's worth.
use strict;
use warnings;
my $howMany = shift || 100;
my $done = 0;
my $rxValidWord = qr{^([A-Za-z])[-a-z]+};
my $totalWordCt = 0;
my $validWordCt = 0;
my %letterCts = ();
my @letters = (q{a} .. q{z});
my $currentLetter = q{};
my $rxCurrent = getNextLetter();
my @wordCache = ();
my $dictFile = q{../Web2};
open my $dictFH, q{<}, $dictFile
or die qq{open: $dictFile: $!\n};
while(<$dictFH>)
{
$totalWordCt ++;
next if $done;
next unless /$rxValidWord/;
my $initLetter = $1;
$validWordCt ++;
chomp;
if($initLetter =~ $rxCurrent)
{
push @wordCache, $_;
}
else
{
reportForLetter();
}
}
reportForLetter() if $currentLetter;
close $dictFH
or die qq{close: $dictFile: $!\n};
print
qq{\n},
qq{Total words in dictionary - $totalWordCt\n},
qq{Valid words processed - $validWordCt\n};
sub generateSlice
{
my ($total, $howMany) = @_;
my $rlPool = [0 .. ($total - 1)];
return $rlPool if $total <= $howMany;
my $rlSlice = [];
for (1 .. $howMany)
{
push @$rlSlice, splice @$rlPool, int rand scalar @$rlPool, 1;
}
return [sort {$a <=> $b} @$rlSlice];
}
sub getNextLetter
{
if(@letters)
{
$currentLetter = shift @letters;
my $charClass =
q{[}
. $currentLetter
. uc $currentLetter
. q{]};
return qr{$charClass};
}
else
{
$currentLetter = q{};
$done ++;
return 0;
}
}
sub reportForLetter
{
my $savedWord = $_;
print
qq{\n},
qq{Found @{[scalar @wordCache]} words },
qq{for letter $currentLetter\n};
my $rlSlice = generateSlice(scalar @wordCache, $howMany);
my @randomWords = @wordCache[@$rlSlice];
print qq{$howMany words selected at random\n};
print qq{ $_\n} for @randomWords;
@wordCache = ($savedWord);
$rxCurrent = getNextLetter();
}
When run to look for 5 random word it produces this
Cheers, JohnGG | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: improving the efficiency of a script
by sulfericacid (Deacon) on Jun 21, 2006 at 03:36 UTC
|
I appreciate the help and insight, everyone. I have a lot of information to read through to get a better idea of how BETTER to do things of this nature.
I don't know why it did not dawn on me that slurping a huge file (my @blah = <DICT>) would be suicide as opposed to reading it line by line directly and pulling off only things I need.
I probably should have figured that one out, but I do appreciate all the valuable advice everyone has put here.
"Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"
sulfericacid
| [reply] [Watch: Dir/Any] |