IB2017 has asked for the wisdom of the Perl Monks concerning the following question:
Given string with text, I need to create n-grams of predefined lengths. I came up with the following. Any suggestions on how to improve it (being speed an important factor in my process?). The sentence, i.e. the array will contain typically 5-15 elements.
use strict;
use warnings;
my $sentence = "this is the text to play with";
my @string = split / /, $sentence;
my $ngramWindow_MIN = 2;
my $ngramWindow_MAX = 3;
for ($ngramWindow_MIN .. $ngramWindow_MAX){
my $ngramWindow=$_;
my $sizeString = (@string) - $ngramWindow;
foreach (0 .. $sizeString){
print "START INDEX: $_ :";
print "@string[$_..($_+$ngramWindow-1)]\n";
}
}
Re: improving speed in ngrams algorithm
by holli (Abbot) on Jun 11, 2019 at 10:52 UTC
|
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
|
| [reply] |
Re: improving speed in ngrams algorithm (updated)
by vr (Curate) on Jun 11, 2019 at 12:34 UTC
|
Worth a try, since regexp compiled just once...
use strict;
use warnings;
use feature 'say';
my $sentence = "this is the text to play with";
my $ngramWindow_MIN = 2;
my $ngramWindow_MAX = 3;
my $word = qr/(\b\S+)(?:\s|$)/; # "word" is $1, rather
my $ngram = join '',
$word x $ngramWindow_MIN,
qr/(?:$word)?/ x ( $ngramWindow_MAX - $ngramWindow_MIN );
my $re = qr/$ngram(?{
say "@{^CAPTURE}"
# or do anything with @words i.e. @{^CAPTURE}
})(*F)/;
$sentence =~ /$re/g;
Update. I noticed that my solution uses variable introduced in 5.25.7. Then I looked up the %- hash, which, by description of it, should provide similar access and was there since 5.10. While it does indeed work, if combined in obvious way with grep to filter undefined values, it looks to me that length of array associated with named capture being not reset is a bug. The %{^CAPTURE_ALL} should be identical to %-, but looks to suffer from another bug even more. Tested in 5.28.0.
use strict;
use warnings;
use Data::Dump 'dd';
my $sentence = "this is the text to play with";
my $ngramWindow_MIN = 2;
my $ngramWindow_MAX = 3;
my $word = qr/(?<word>\b\S+)(?:\s|$)/;
my $ngram = join '',
$word x $ngramWindow_MIN,
qr/(?:$word)?/ x ( $ngramWindow_MAX - $ngramWindow_MIN );
my $re = qr/$ngram(?{
dd \@{^CAPTURE};
dd $-{word};
dd ${^CAPTURE_ALL}{word};
print "\n";
})(*F)/;
$sentence =~ /$re/g;
__END__
["this", "is", "the"]
["this", "is", "the"]
"this"
["this", "is"]
["this", "is", undef]
"this"
["is", "the", "text"]
["is", "the", "text"]
"is"
["is", "the"]
["is", "the", undef]
"is"
["the", "text", "to"]
["the", "text", "to"]
"the"
["the", "text"]
["the", "text", undef]
"the"
["text", "to", "play"]
["text", "to", "play"]
"text"
["text", "to"]
["text", "to", undef]
"text"
["to", "play", "with"]
["to", "play", "with"]
"to"
["to", "play"]
["to", "play", undef]
"to"
["play", "with"]
["play", "with", undef]
"play"
Update 2. See %{^CAPTURE}, %{^CAPTURE_ALL} and %- don't produce expected output.
| [reply] [d/l] [select] |
|
Can you describe the bug using a simpler example, maybe as a new SoPW node? I'm not sure I understand - please include the expected output, too.
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] |
Re: improving speed in ngrams algorithm
by Eily (Monsignor) on Jun 11, 2019 at 12:35 UTC
|
(being speed an important factor in my process?) Are you sure? Did you actually notice that the program was too slow, or do you just think being fast might be a good thing? 5-15 elements by array is very little. So unless you process several thousands of strings, micro optimizations won't have a noticeable impact. And if you process that many strings and see that the program is slow, there might be better places to improve it than this quite basic code.
Like I said, it won't change much but one thing you can do is change
for ($ngramWindow_MIN .. $ngramWindow_MAX){
my $ngramWindow=$_;
into
for my $ngramWindow ($ngramWindow_MIN .. $ngramWindow_MAX) {
| [reply] [d/l] [select] |
Re: improving speed in ngrams algorithm
by tybalt89 (Monsignor) on Jun 12, 2019 at 09:02 UTC
|
#!/usr/bin/perl
# https://perlmonks.org/?node_id=11101225
use strict;
use warnings;
my $sentence = "this is the text to play with";
my $ngramWindow_MIN = 2;
my $ngramWindow_MAX = 3;
my ($low, $high) = ($ngramWindow_MIN - 1, $ngramWindow_MAX - 1);
$sentence =~ /(?<!\S)\S+(?: \S+){$low,$high}?(?!\S)(?{
print "START INDEX: @{[$` =~ tr| || ]} : $&\n"
})(*FAIL)/;
Outputs (same lines, slightly different order) :
START INDEX: 0 : this is
START INDEX: 0 : this is the
START INDEX: 1 : is the
START INDEX: 1 : is the text
START INDEX: 2 : the text
START INDEX: 2 : the text to
START INDEX: 3 : text to
START INDEX: 3 : text to play
START INDEX: 4 : to play
START INDEX: 4 : to play with
START INDEX: 5 : play with
| [reply] [d/l] [select] |
Re: improving speed in ngrams algorithm
by AnomalousMonk (Archbishop) on Jun 11, 2019 at 16:03 UTC
|
The (?= (pattern)) overlapping capture trick may be useful here.
If you need to capture info like offsets | character offsets within the sentence, maybe something like:
c:\@Work\Perl\monks>perl -wMstrict -le
"my $sentence = 'this is the text to play with';
;;
my $ngramWindow_MIN = 1;
my $ngramWindow_MAX = 3;
;;
my @word_ngrams;
;;
for my $ngramWindow ($ngramWindow_MIN .. $ngramWindow_MAX) {
my $m = $ngramWindow - 1;
my $ngram = qr{ \b [[:alpha:]]+ (?: \s+ [[:alpha:]]+){$m} \b }xms;
;;
while ($sentence =~ m{ (?= ($ngram)) }xmsg) {
push @word_ngrams, [ $1, $-[1] ];
}
}
;;
for my $ar_wng (@word_ngrams) {
my ($word_ngram, $sentence_offset) = @$ar_wng;
print qq{'$word_ngram' at sentence offset $sentence_offset};
}
"
'this' at sentence offset 0
'is' at sentence offset 5
'the' at sentence offset 8
'text' at sentence offset 12
'to' at sentence offset 17
'play' at sentence offset 20
'with' at sentence offset 25
'this is' at sentence offset 0
'is the' at sentence offset 5
'the text' at sentence offset 8
'text to' at sentence offset 12
'to play' at sentence offset 17
'play with' at sentence offset 20
'this is the' at sentence offset 0
'is the text' at sentence offset 5
'the text to' at sentence offset 8
'text to play' at sentence offset 12
'to play with' at sentence offset 17
If it's all you need, it would be faster to capture "naked" n-grams:
c:\@Work\Perl\monks>perl -wMstrict -le
"my $sentence = 'this is the text to play with';
;;
my $ngramWindow_MIN = 1;
my $ngramWindow_MAX = 3;
;;
for my $ngramWindow ($ngramWindow_MIN .. $ngramWindow_MAX) {
print qq{$ngramWindow-word ngrams of '$sentence'};
my $m = $ngramWindow - 1;
my $ngram = qr{ \b [[:alpha:]]+ (?: \s+ [[:alpha:]]+){$m} \b }xms;
;;
my @word_ngrams = $sentence =~ m{ (?= ($ngram)) }xmsg;
;;
for my $word_ngram (@word_ngrams) {
print qq{ '$word_ngram'};
}
}
"
1-word ngrams of 'this is the text to play with'
'this'
'is'
'the'
'text'
'to'
'play'
'with'
2-word ngrams of 'this is the text to play with'
'this is'
'is the'
'the text'
'text to'
'to play'
'play with'
3-word ngrams of 'this is the text to play with'
'this is the'
'is the text'
'the text to'
'text to play'
'to play with'
Tested under Perl version 5.8.9. (I haven't done any Benchmark-ing on any of this :)
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: improving speed in ngrams algorithm
by johngg (Canon) on Jun 12, 2019 at 10:21 UTC
|
A solution using split, array slices and shift. No idea if it is fast or slow as I haven't run any benchmarks.
use 5.026;
use warnings;
my $text = q{this is the text to play with};
for ( 1 .. 8 )
{
say qq{$_-word ngrams of '$text'};
say for nGramWords( $_, $text );
say q{-} x 20;
}
sub nGramWords
{
my( $nWords, $string ) = @_;
my @words = split m{\s+}, $string;
my $start = 0;
my @nGrams;
while ( scalar @words >= $nWords )
{
push @nGrams, join q{ },
qq{START INDEX: @{ [ $start ++ ] } : },
@words[ 0 .. $nWords - 1 ];
shift @words;
}
return @nGrams;
}
The output.
1-word ngrams of 'this is the text to play with'
START INDEX: 0 : this
START INDEX: 1 : is
START INDEX: 2 : the
START INDEX: 3 : text
START INDEX: 4 : to
START INDEX: 5 : play
START INDEX: 6 : with
--------------------
2-word ngrams of 'this is the text to play with'
START INDEX: 0 : this is
START INDEX: 1 : is the
START INDEX: 2 : the text
START INDEX: 3 : text to
START INDEX: 4 : to play
START INDEX: 5 : play with
--------------------
3-word ngrams of 'this is the text to play with'
START INDEX: 0 : this is the
START INDEX: 1 : is the text
START INDEX: 2 : the text to
START INDEX: 3 : text to play
START INDEX: 4 : to play with
--------------------
4-word ngrams of 'this is the text to play with'
START INDEX: 0 : this is the text
START INDEX: 1 : is the text to
START INDEX: 2 : the text to play
START INDEX: 3 : text to play with
--------------------
5-word ngrams of 'this is the text to play with'
START INDEX: 0 : this is the text to
START INDEX: 1 : is the text to play
START INDEX: 2 : the text to play with
--------------------
6-word ngrams of 'this is the text to play with'
START INDEX: 0 : this is the text to play
START INDEX: 1 : is the text to play with
--------------------
7-word ngrams of 'this is the text to play with'
START INDEX: 0 : this is the text to play with
--------------------
8-word ngrams of 'this is the text to play with'
--------------------
I hope this is of interest.
| [reply] [d/l] [select] |
Re: improving speed in ngrams algorithm
by karlgoethebier (Abbot) on Jun 11, 2019 at 13:49 UTC
|
See also Text::Ngrams. Regards, Karl
«The Crux of the Biscuit is the Apostrophe»
perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help
| [reply] [d/l] |
Re: improving speed in ngrams algorithm (benchmark time! kindof)
by vr (Curate) on Jun 20, 2019 at 15:35 UTC
|
What if, while working environment is familiar Perl program, we use something completely different (benchmarks below, vocabulary here, or revised one there)?
]s =. 'the text to play with' NB. String
the text to play with
]L =. 2 3 4 NB. Ngram lengths
2 3 4
NB. Running a little ahead, the sentence to get us each ngram in it
+s box,
NB. with each word nicely boxed:
;L ([:<<\)"(0 _)(<;._2) s,' '
+----------+---------+---------+-----------+-------------+------------
+--+--------------+------------------+-------------------+
|+---+----+|+----+--+|+--+----+|+----+----+|+---+----+--+|+----+--+---
+-+|+--+----+----+|+---+----+--+----+|+----+--+----+----+|
||the|text|||text|to|||to|play|||play|with|||the|text|to|||text|to|pla
+y|||to|play|with|||the|text|to|play|||text|to|play|with||
|+---+----+|+----+--+|+--+----+|+----+----+|+---+----+--+|+----+--+---
+-+|+--+----+----+|+---+----+--+----+|+----+--+----+----+|
+----------+---------+---------+-----------+-------------+------------
+--+--------------+------------------+-------------------+
NB. But, this structure would mean a lot of unpleasant pointer arit
+hmetic
NB. on Perl side (indirect, indirect, offset, indirect, etc.). It s
+lows things, too.
NB. Instead, let's have best of both worlds, by strictly adhering t
+o
NB. task in OP -- let each ngram be a string, let's append ASCII-0
+to each, catenate,
NB. return pointer to Perl, and just unpack with '(Z*)*'.
NB. Little explanation (s is string, L is ngram lengths, as above):
]w =. <;.2 s,' ' NB. Words
+----+-----+---+-----+-----+
|the |text |to |play |with |
+----+-----+---+-----+-----+
NB. ;.2 - cut by last item (char), keeping it
NB. , - append, < - box
L (<@:,&(35}a.)@:}:@:;)\ w
+-----------------+------------------+-------------+----------+
|the text# |text to# |to play# |play with#|
+-----------------+------------------+-------------+----------+
|the text to# |text to play# |to play with#| |
+-----------------+------------------+-------------+----------+
|the text to play#|text to play with#| | |
+-----------------+------------------+-------------+----------+
NB. \ - apply verb to overlapping infixes of lengths L of w
NB. @: - composition, & - curry, and read right to left:
NB. ; - raze (unbox items), }: - curtail,
NB. 35}a. - take element from ASCII table,
NB. 35 just for sake of display, it will be 0 in practice
NB. Almost there
;a:-.~,L (<@:,&(35}a.)@:}:@:;)\ w
the text#text to#to play#play with#the text to#text to play#to play wi
+th#the text to play#text to play with#
NB. , - ravel (flatten shape)
NB. ~ - swap left/right operands (just easier to write)
NB. -. - exclude. What to exclude?
NB. a: is called "Ace", it's an empty box
NB. But OP wanted indexes of words at ngram starts. It's easy:
i.# w
0 1 2 3 4
NB. i. - integer list, # - number of items.
NB. Then simple, but vectorized, arithmetic.
2 > L -"0 _ i.# w
0 1 1 1 1
0 0 1 1 1
0 0 0 1 1
NB. Then, for each row of matrix above:
NB. |. - reverse, I. - indexes of true elements,
NB. >: - increment (because short lines are padded with zeroes)
(>:@:I.@:|.)"1 ]2 > L -"0 _ i.# w
1 2 3 4
1 2 3 0
1 2 0 0
<:0-.~,(>:@:I.@:|.)"1 ]2> L -"0 _ i.# w
0 1 2 3 0 1 2 0 1
NB. Above -- ravel, exclude zeroes, decrement.
NB. So, phrases to get us (catenated) ngrams and list of starting i
+ndexes:
;a:-.~,L (<@:,&(35}a.)@:}:@:;)\ w
the text#text to#to play#play with#the text to#text to play#to play wi
+th#the text to play#text to play with#
<:0-.~,(>:@:I.@:|.)"1 ]2> L -"0 _ i.# w
0 1 2 3 0 1 2 0 1
NB. What about sorting not by ngram length, but starting index?
NB. Nothing easier: transpose (|:) at appropriate moment:
;a:-.~,|:L (<@:,&(35}a.)@:}:@:;)\ w
the text#the text to#the text to play#text to#text to play#text to pla
+y with#to play#to play with#play with#
<:0-.~,|:(>:@:I.@:|.)"1 ]2> L -"0 _ i.# w
0 0 0 1 1 1 2 2 3
Perl:
use strict;
use warnings;
use utf8;
use feature 'say';
use Data::Dump 'dd';
use FFI::Raw;
use Config;
use Benchmark 'cmpthese';
die unless $Config{ ptrsize } == 8; # some numbers hardcoded below
my $str = join ' ', ('this is the text to play with') x 5;
my $ngramWindow_MIN = 2;
my $ngramWindow_MAX = 6;
my ( $low, $high ) = ( $ngramWindow_MIN - 1, $ngramWindow_MAX - 1 );
my $i_ = FFI::Raw::ulong;
my $s_ = FFI::Raw::str;
my $p_ = FFI::Raw::ptr;
my $dll = "$ENV{HOME}/j64-807/bin/j.dll"; # libj.so";
my $pJ = FFI::Raw-> new( $dll, 'JInit', $i_ )-> call or die;
my $cmd = 'ngrams =: ;a:-.~,L (<@:,&(0}a.)@:}:@:;)\ w '.
'[indxs =: <:0-.~,(>:@:I.@:|.)"1 ]2> L -"0 _ i.# w '.
'[w =: <;.2 s '.
"[L =: @{[ $ngramWindow_MIN .. $ngramWindow_MAX ]} ".
"[s =: '$str',' '";
my $JDo = FFI::Raw-> new( $dll, 'JDo', $i_, $i_, $s_ )-> coderef;
my $JGetM = FFI::Raw-> new( $dll, 'JGetM', $i_, $i_, $s_, ( $p_ ) x 4
+)-> coderef;
my $type = FFI::Raw::memptr( 8 );
my $rank = FFI::Raw::memptr( 8 );
my $shape = FFI::Raw::memptr( 8 );
my $data = FFI::Raw::memptr( 8 );
# Demo
dd j();
# Benchmark
cmpthese -2, {
j => \&j,
AnomalousMonk => \&AnomalousMonk,
johngg => \&johngg,
tybalt89 => \&tybalt89,
};
sub j {
$JDo-> ( $pJ, $cmd );
$JGetM-> ( $pJ, 'ngrams', $type, $rank, $shape, $data );
my $len = unpack 'Q', unpack 'P8', $shape-> tostr( 8 );
my @ngrams = unpack '(Z*)*', unpack "P$len", $data-> tostr( 8 );
$JGetM-> ( $pJ, 'indxs', $type, $rank, $shape, $data );
$len = 8 * unpack 'Q', unpack 'P8', $shape-> tostr( 8 );
my @indxs = unpack 'Q*', unpack "P$len", $data-> tostr( 8 );
return \@ngrams, \@indxs
}
sub AnomalousMonk {
my @ngrams;
for my $ngramWindow ($ngramWindow_MIN .. $ngramWindow_MAX) {
my $m = $ngramWindow - 1;
my $ngram = qr{ \b [[:alpha:]]+ (?: \s+ [[:alpha:]]+){$m} \b }
+xms;
my @word_ngrams = $str =~ m{ (?= ($ngram)) }xmsg;
push @ngrams, @word_ngrams;
}
return \@ngrams;
}
sub johngg {
my ( @ngrams, @indxs );
for my $nWords ( $ngramWindow_MIN .. $ngramWindow_MAX )
{
my @words = split m{\s+}, $str;
my $start = 0;
while ( scalar @words >= $nWords )
{
push @indxs, $start ++;
push @ngrams, join " ", @words[ 0 .. $nWords - 1 ];
shift @words;
}
};
return \@ngrams, \@indxs
}
sub tybalt89 {
my ( @ngrams, @indxs );
$str =~ /(?<!\S)\S+(?: \S+){$low,$high}?(?!\S)(?{
push @ngrams, $&;
push @indxs, $` =~ tr| ||;
})(*FAIL)/;
return \@ngrams, \@indxs
}
__END__
# demo output skipped
Rate AnomalousMonk tybalt89 johngg
+ j
AnomalousMonk 4345/s -- -33% -34%
+ -59%
tybalt89 6468/s 49% -- -2%
+ -39%
johngg 6590/s 52% 2% --
+ -38%
j 10560/s 143% 63% 60%
+ --
I pumped difficulty level just very slightly up, or otherwise (as example in OP) it would be ridiculous to optimize what's very fast as is. Note, J sentence is interpreted every time, so to be fair I should have wrapped other players into string eval. I tried to preserve other monks code while bending it to serve "array of ngrams, array of indexes" goal, where possible. Sorry if I messed. As I understand, to modify J phrase to work with Unicode text and/or return character offsets would be easy. + Of course my J is absolutely unoptimized, as I'm total beginner. The moral, there is very powerful tool and now I know how to use it from Perl :)
Edit: fixed spelling, sorry.
| [reply] [d/l] [select] |
Re: improving speed in ngrams algorithm (updated)
by LanX (Saint) on Jun 11, 2019 at 11:04 UTC
|
Please ignore! Misunderstood question.
My answer treats ngrams on characters not words.
A regex should be faster, this demo in the debugger for n=3 should give you a start.
DB<30> $str = join "", a..l
DB<31> @res=()
DB<32> for my $start (0..2) { pos($str) =$start; push @res, $str =~
+m/(.{3})/g }
DB<33> x @res
0 'abc'
1 'def'
2 'ghi'
3 'jkl'
4 'bcd'
5 'efg'
6 'hij'
7 'cde'
8 'fgh'
9 'ijk'
NB:
- the order is not preserved
- you may want to change the regex to not match whitespaces or punctuation.
(I know it's possible in a single regex without looping over start by playing around with \K or similar.
I'll leave it to the regex gurus like tybalt to show it ;-)
HTH! :)
update
In case you want really want to include non-letters try unpack
| [reply] [d/l] |
|
A regex should be faster I would already doubt that a regex is faster than accessing array elements in normal circumstances, but here you seem to have missed the fact that the n-grams are made of words rather than chars. So your regex becomes:
/((\w+\s?){3})/g where each char of (part of) the string are checked to find spaces. In IB2017's solution this is done once by the split.
I know it's possible in a single regex without looping over start by playing around with \K or similar Look ahead assertions can help:
DB<7> say for 'perlmonks' =~ /(?=(.{3}))./g
per
erl
rlm
lmo
mon
onk
nks
But it becomes cumbersome when working with words /(?=((\w+\s?){3}))\w+/g and probably not faster.
In case you want really want to include non-letters try unpack unpack would probably be among the fastest solutions for character n-grams indeed. | [reply] [d/l] [select] |
|
Seems like I misread the sample code.
I saw split // not split / /
That's why added the NB part saying to exclude white spaces and punctuation (which isn't done in the OP s code)
I haven't run ° it but the code looks broken to me if the split wasn't meant to be per character. @string holding words doesn't make sense to me!
I don't think that you can effectively process a natural language without regex.
Update
°) I ran it on my mobile and the output shows that the OP is looking for n words in a row.
Hence we both misunderstood his definition of n gram
START INDEX: 0 :this is
START INDEX: 1 :is the
START INDEX: 2 :the text
START INDEX: 3 :text to
START INDEX: 4 :to play
START INDEX: 5 :play with
START INDEX: 0 :this is the
START INDEX: 1 :is the text
START INDEX: 2 :the text to
START INDEX: 3 :text to play
START INDEX: 4 :to play with
| [reply] [d/l] [select] |
|
|