Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

improving speed in ngrams algorithm

by IB2017 (Monk)
on Jun 11, 2019 at 10:37 UTC ( #11101225=perlquestion: print w/replies, xml ) Need Help??

IB2017 has asked for the wisdom of the Perl Monks concerning the following question:

Given string with text, I need to create n-grams of predefined lengths. I came up with the following. Any suggestions on how to improve it (being speed an important factor in my process?). The sentence, i.e. the array will contain typically 5-15 elements.

use strict; use warnings; my $sentence = "this is the text to play with"; my @string = split / /, $sentence; my $ngramWindow_MIN = 2; my $ngramWindow_MAX = 3; for ($ngramWindow_MIN .. $ngramWindow_MAX){ my $ngramWindow=$_; my $sizeString = (@string) - $ngramWindow; foreach (0 .. $sizeString){ print "START INDEX: $_ :"; print "@string[$_..($_+$ngramWindow-1)]\n"; } }

Replies are listed 'Best First'.
Re: improving speed in ngrams algorithm
by holli (Monsignor) on Jun 11, 2019 at 10:52 UTC
    I would start with not asking the array all the time how big it is. It doesn't change, so cache that value. Replacing the ranges with classic for loops might speed things up. I doubt a substr based solution be any faster. This is simple enough to do it in C though, even for a novice. If it's really time critical I would write it as XS.


    holli

    You can lead your users to water, but alas, you cannot drown them.
      “...do it in C though, even for a novice. If it's really time critical I would write it as XS.“

      This statement is not good just crap, sorry. Because nothing with C/XS is simple for a novice. Best regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

      perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

        The OP didn't ask about "easy" but "speed".

        I'm sure there are already ready to use C programs for this available.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: improving speed in ngrams algorithm (updated)
by vr (Deacon) on Jun 11, 2019 at 12:34 UTC
    Worth a try, since regexp compiled just once...
    use strict; use warnings; use feature 'say'; my $sentence = "this is the text to play with"; my $ngramWindow_MIN = 2; my $ngramWindow_MAX = 3; my $word = qr/(\b\S+)(?:\s|$)/; # "word" is $1, rather my $ngram = join '', $word x $ngramWindow_MIN, qr/(?:$word)?/ x ( $ngramWindow_MAX - $ngramWindow_MIN ); my $re = qr/$ngram(?{ say "@{^CAPTURE}" # or do anything with @words i.e. @{^CAPTURE} })(*F)/; $sentence =~ /$re/g;

    Update. I noticed that my solution uses variable introduced in 5.25.7. Then I looked up the %- hash, which, by description of it, should provide similar access and was there since 5.10. While it does indeed work, if combined in obvious way with grep to filter undefined values, it looks to me that length of array associated with named capture being not reset is a bug. The %{^CAPTURE_ALL} should be identical to %-, but looks to suffer from another bug even more. Tested in 5.28.0.

    use strict; use warnings; use Data::Dump 'dd'; my $sentence = "this is the text to play with"; my $ngramWindow_MIN = 2; my $ngramWindow_MAX = 3; my $word = qr/(?<word>\b\S+)(?:\s|$)/; my $ngram = join '', $word x $ngramWindow_MIN, qr/(?:$word)?/ x ( $ngramWindow_MAX - $ngramWindow_MIN ); my $re = qr/$ngram(?{ dd \@{^CAPTURE}; dd $-{word}; dd ${^CAPTURE_ALL}{word}; print "\n"; })(*F)/; $sentence =~ /$re/g; __END__ ["this", "is", "the"] ["this", "is", "the"] "this" ["this", "is"] ["this", "is", undef] "this" ["is", "the", "text"] ["is", "the", "text"] "is" ["is", "the"] ["is", "the", undef] "is" ["the", "text", "to"] ["the", "text", "to"] "the" ["the", "text"] ["the", "text", undef] "the" ["text", "to", "play"] ["text", "to", "play"] "text" ["text", "to"] ["text", "to", undef] "text" ["to", "play", "with"] ["to", "play", "with"] "to" ["to", "play"] ["to", "play", undef] "to" ["play", "with"] ["play", "with", undef] "play"

    Update 2. See %{^CAPTURE}, %{^CAPTURE_ALL} and %- don't produce expected output.

      Can you describe the bug using a simpler example, maybe as a new SoPW node? I'm not sure I understand - please include the expected output, too.

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: improving speed in ngrams algorithm
by Eily (Monsignor) on Jun 11, 2019 at 12:35 UTC

    (being speed an important factor in my process?)
    Are you sure? Did you actually notice that the program was too slow, or do you just think being fast might be a good thing? 5-15 elements by array is very little. So unless you process several thousands of strings, micro optimizations won't have a noticeable impact. And if you process that many strings and see that the program is slow, there might be better places to improve it than this quite basic code.

    Like I said, it won't change much but one thing you can do is change

    for ($ngramWindow_MIN .. $ngramWindow_MAX){ my $ngramWindow=$_;
    into
    for my $ngramWindow ($ngramWindow_MIN .. $ngramWindow_MAX) {

Re: improving speed in ngrams algorithm
by tybalt89 (Parson) on Jun 12, 2019 at 09:02 UTC

    Benchmarking left to someone who cares :)

    #!/usr/bin/perl # https://perlmonks.org/?node_id=11101225 use strict; use warnings; my $sentence = "this is the text to play with"; my $ngramWindow_MIN = 2; my $ngramWindow_MAX = 3; my ($low, $high) = ($ngramWindow_MIN - 1, $ngramWindow_MAX - 1); $sentence =~ /(?<!\S)\S+(?: \S+){$low,$high}?(?!\S)(?{ print "START INDEX: @{[$` =~ tr| || ]} : $&\n" })(*FAIL)/;

    Outputs (same lines, slightly different order) :

    START INDEX: 0 : this is START INDEX: 0 : this is the START INDEX: 1 : is the START INDEX: 1 : is the text START INDEX: 2 : the text START INDEX: 2 : the text to START INDEX: 3 : text to START INDEX: 3 : text to play START INDEX: 4 : to play START INDEX: 4 : to play with START INDEX: 5 : play with
Re: improving speed in ngrams algorithm
by AnomalousMonk (Chancellor) on Jun 11, 2019 at 16:03 UTC

    The  (?= (pattern)) overlapping capture trick may be useful here.

    If you need to capture info like offsets | character offsets within the sentence, maybe something like:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $sentence = 'this is the text to play with'; ;; my $ngramWindow_MIN = 1; my $ngramWindow_MAX = 3; ;; my @word_ngrams; ;; for my $ngramWindow ($ngramWindow_MIN .. $ngramWindow_MAX) { my $m = $ngramWindow - 1; my $ngram = qr{ \b [[:alpha:]]+ (?: \s+ [[:alpha:]]+){$m} \b }xms; ;; while ($sentence =~ m{ (?= ($ngram)) }xmsg) { push @word_ngrams, [ $1, $-[1] ]; } } ;; for my $ar_wng (@word_ngrams) { my ($word_ngram, $sentence_offset) = @$ar_wng; print qq{'$word_ngram' at sentence offset $sentence_offset}; } " 'this' at sentence offset 0 'is' at sentence offset 5 'the' at sentence offset 8 'text' at sentence offset 12 'to' at sentence offset 17 'play' at sentence offset 20 'with' at sentence offset 25 'this is' at sentence offset 0 'is the' at sentence offset 5 'the text' at sentence offset 8 'text to' at sentence offset 12 'to play' at sentence offset 17 'play with' at sentence offset 20 'this is the' at sentence offset 0 'is the text' at sentence offset 5 'the text to' at sentence offset 8 'text to play' at sentence offset 12 'to play with' at sentence offset 17

    If it's all you need, it would be faster to capture "naked" n-grams:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $sentence = 'this is the text to play with'; ;; my $ngramWindow_MIN = 1; my $ngramWindow_MAX = 3; ;; for my $ngramWindow ($ngramWindow_MIN .. $ngramWindow_MAX) { print qq{$ngramWindow-word ngrams of '$sentence'}; my $m = $ngramWindow - 1; my $ngram = qr{ \b [[:alpha:]]+ (?: \s+ [[:alpha:]]+){$m} \b }xms; ;; my @word_ngrams = $sentence =~ m{ (?= ($ngram)) }xmsg; ;; for my $word_ngram (@word_ngrams) { print qq{ '$word_ngram'}; } } " 1-word ngrams of 'this is the text to play with' 'this' 'is' 'the' 'text' 'to' 'play' 'with' 2-word ngrams of 'this is the text to play with' 'this is' 'is the' 'the text' 'text to' 'to play' 'play with' 3-word ngrams of 'this is the text to play with' 'this is the' 'is the text' 'the text to' 'text to play' 'to play with'

    Tested under Perl version 5.8.9. (I haven't done any Benchmark-ing on any of this :)


    Give a man a fish:  <%-{-{-{-<

Re: improving speed in ngrams algorithm
by karlgoethebier (Monsignor) on Jun 11, 2019 at 13:49 UTC

    See also Text::Ngrams. Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re: improving speed in ngrams algorithm
by johngg (Canon) on Jun 12, 2019 at 10:21 UTC

    A solution using split, array slices and shift. No idea if it is fast or slow as I haven't run any benchmarks.

    use 5.026; use warnings; my $text = q{this is the text to play with}; for ( 1 .. 8 ) { say qq{$_-word ngrams of '$text'}; say for nGramWords( $_, $text ); say q{-} x 20; } sub nGramWords { my( $nWords, $string ) = @_; my @words = split m{\s+}, $string; my $start = 0; my @nGrams; while ( scalar @words >= $nWords ) { push @nGrams, join q{ }, qq{START INDEX: @{ [ $start ++ ] } : }, @words[ 0 .. $nWords - 1 ]; shift @words; } return @nGrams; }

    The output.

    1-word ngrams of 'this is the text to play with' START INDEX: 0 : this START INDEX: 1 : is START INDEX: 2 : the START INDEX: 3 : text START INDEX: 4 : to START INDEX: 5 : play START INDEX: 6 : with -------------------- 2-word ngrams of 'this is the text to play with' START INDEX: 0 : this is START INDEX: 1 : is the START INDEX: 2 : the text START INDEX: 3 : text to START INDEX: 4 : to play START INDEX: 5 : play with -------------------- 3-word ngrams of 'this is the text to play with' START INDEX: 0 : this is the START INDEX: 1 : is the text START INDEX: 2 : the text to START INDEX: 3 : text to play START INDEX: 4 : to play with -------------------- 4-word ngrams of 'this is the text to play with' START INDEX: 0 : this is the text START INDEX: 1 : is the text to START INDEX: 2 : the text to play START INDEX: 3 : text to play with -------------------- 5-word ngrams of 'this is the text to play with' START INDEX: 0 : this is the text to START INDEX: 1 : is the text to play START INDEX: 2 : the text to play with -------------------- 6-word ngrams of 'this is the text to play with' START INDEX: 0 : this is the text to play START INDEX: 1 : is the text to play with -------------------- 7-word ngrams of 'this is the text to play with' START INDEX: 0 : this is the text to play with -------------------- 8-word ngrams of 'this is the text to play with' --------------------

    I hope this is of interest.

    Cheers,

    JohnGG

Re: improving speed in ngrams algorithm (benchmark time! kindof)
by vr (Deacon) on Jun 20, 2019 at 15:35 UTC

    What if, while working environment is familiar Perl program, we use something completely different (benchmarks below, vocabulary here, or revised one there)?

    ]s =. 'the text to play with' NB. String the text to play with ]L =. 2 3 4 NB. Ngram lengths 2 3 4 NB. Running a little ahead, the sentence to get us each ngram in it +s box, NB. with each word nicely boxed: ;L ([:<<\)"(0 _)(<;._2) s,' ' +----------+---------+---------+-----------+-------------+------------ +--+--------------+------------------+-------------------+ |+---+----+|+----+--+|+--+----+|+----+----+|+---+----+--+|+----+--+--- +-+|+--+----+----+|+---+----+--+----+|+----+--+----+----+| ||the|text|||text|to|||to|play|||play|with|||the|text|to|||text|to|pla +y|||to|play|with|||the|text|to|play|||text|to|play|with|| |+---+----+|+----+--+|+--+----+|+----+----+|+---+----+--+|+----+--+--- +-+|+--+----+----+|+---+----+--+----+|+----+--+----+----+| +----------+---------+---------+-----------+-------------+------------ +--+--------------+------------------+-------------------+ NB. But, this structure would mean a lot of unpleasant pointer arit +hmetic NB. on Perl side (indirect, indirect, offset, indirect, etc.). It s +lows things, too. NB. Instead, let's have best of both worlds, by strictly adhering t +o NB. task in OP -- let each ngram be a string, let's append ASCII-0 +to each, catenate, NB. return pointer to Perl, and just unpack with '(Z*)*'. NB. Little explanation (s is string, L is ngram lengths, as above): ]w =. <;.2 s,' ' NB. Words +----+-----+---+-----+-----+ |the |text |to |play |with | +----+-----+---+-----+-----+ NB. ;.2 - cut by last item (char), keeping it NB. , - append, < - box L (<@:,&(35}a.)@:}:@:;)\ w +-----------------+------------------+-------------+----------+ |the text# |text to# |to play# |play with#| +-----------------+------------------+-------------+----------+ |the text to# |text to play# |to play with#| | +-----------------+------------------+-------------+----------+ |the text to play#|text to play with#| | | +-----------------+------------------+-------------+----------+ NB. \ - apply verb to overlapping infixes of lengths L of w NB. @: - composition, & - curry, and read right to left: NB. ; - raze (unbox items), }: - curtail, NB. 35}a. - take element from ASCII table, NB. 35 just for sake of display, it will be 0 in practice NB. Almost there ;a:-.~,L (<@:,&(35}a.)@:}:@:;)\ w the text#text to#to play#play with#the text to#text to play#to play wi +th#the text to play#text to play with# NB. , - ravel (flatten shape) NB. ~ - swap left/right operands (just easier to write) NB. -. - exclude. What to exclude? NB. a: is called "Ace", it's an empty box NB. But OP wanted indexes of words at ngram starts. It's easy: i.# w 0 1 2 3 4 NB. i. - integer list, # - number of items. NB. Then simple, but vectorized, arithmetic. 2 > L -"0 _ i.# w 0 1 1 1 1 0 0 1 1 1 0 0 0 1 1 NB. Then, for each row of matrix above: NB. |. - reverse, I. - indexes of true elements, NB. >: - increment (because short lines are padded with zeroes) (>:@:I.@:|.)"1 ]2 > L -"0 _ i.# w 1 2 3 4 1 2 3 0 1 2 0 0 <:0-.~,(>:@:I.@:|.)"1 ]2> L -"0 _ i.# w 0 1 2 3 0 1 2 0 1 NB. Above -- ravel, exclude zeroes, decrement. NB. So, phrases to get us (catenated) ngrams and list of starting i +ndexes: ;a:-.~,L (<@:,&(35}a.)@:}:@:;)\ w the text#text to#to play#play with#the text to#text to play#to play wi +th#the text to play#text to play with# <:0-.~,(>:@:I.@:|.)"1 ]2> L -"0 _ i.# w 0 1 2 3 0 1 2 0 1 NB. What about sorting not by ngram length, but starting index? NB. Nothing easier: transpose (|:) at appropriate moment: ;a:-.~,|:L (<@:,&(35}a.)@:}:@:;)\ w the text#the text to#the text to play#text to#text to play#text to pla +y with#to play#to play with#play with# <:0-.~,|:(>:@:I.@:|.)"1 ]2> L -"0 _ i.# w 0 0 0 1 1 1 2 2 3

    Perl:

    use strict; use warnings; use utf8; use feature 'say'; use Data::Dump 'dd'; use FFI::Raw; use Config; use Benchmark 'cmpthese'; die unless $Config{ ptrsize } == 8; # some numbers hardcoded below my $str = join ' ', ('this is the text to play with') x 5; my $ngramWindow_MIN = 2; my $ngramWindow_MAX = 6; my ( $low, $high ) = ( $ngramWindow_MIN - 1, $ngramWindow_MAX - 1 ); my $i_ = FFI::Raw::ulong; my $s_ = FFI::Raw::str; my $p_ = FFI::Raw::ptr; my $dll = "$ENV{HOME}/j64-807/bin/j.dll"; # libj.so"; my $pJ = FFI::Raw-> new( $dll, 'JInit', $i_ )-> call or die; my $cmd = 'ngrams =: ;a:-.~,L (<@:,&(0}a.)@:}:@:;)\ w '. '[indxs =: <:0-.~,(>:@:I.@:|.)"1 ]2> L -"0 _ i.# w '. '[w =: <;.2 s '. "[L =: @{[ $ngramWindow_MIN .. $ngramWindow_MAX ]} ". "[s =: '$str',' '"; my $JDo = FFI::Raw-> new( $dll, 'JDo', $i_, $i_, $s_ )-> coderef; my $JGetM = FFI::Raw-> new( $dll, 'JGetM', $i_, $i_, $s_, ( $p_ ) x 4 +)-> coderef; my $type = FFI::Raw::memptr( 8 ); my $rank = FFI::Raw::memptr( 8 ); my $shape = FFI::Raw::memptr( 8 ); my $data = FFI::Raw::memptr( 8 ); # Demo dd j(); # Benchmark cmpthese -2, { j => \&j, AnomalousMonk => \&AnomalousMonk, johngg => \&johngg, tybalt89 => \&tybalt89, }; sub j { $JDo-> ( $pJ, $cmd ); $JGetM-> ( $pJ, 'ngrams', $type, $rank, $shape, $data ); my $len = unpack 'Q', unpack 'P8', $shape-> tostr( 8 ); my @ngrams = unpack '(Z*)*', unpack "P$len", $data-> tostr( 8 ); $JGetM-> ( $pJ, 'indxs', $type, $rank, $shape, $data ); $len = 8 * unpack 'Q', unpack 'P8', $shape-> tostr( 8 ); my @indxs = unpack 'Q*', unpack "P$len", $data-> tostr( 8 ); return \@ngrams, \@indxs } sub AnomalousMonk { my @ngrams; for my $ngramWindow ($ngramWindow_MIN .. $ngramWindow_MAX) { my $m = $ngramWindow - 1; my $ngram = qr{ \b [[:alpha:]]+ (?: \s+ [[:alpha:]]+){$m} \b } +xms; my @word_ngrams = $str =~ m{ (?= ($ngram)) }xmsg; push @ngrams, @word_ngrams; } return \@ngrams; } sub johngg { my ( @ngrams, @indxs ); for my $nWords ( $ngramWindow_MIN .. $ngramWindow_MAX ) { my @words = split m{\s+}, $str; my $start = 0; while ( scalar @words >= $nWords ) { push @indxs, $start ++; push @ngrams, join " ", @words[ 0 .. $nWords - 1 ]; shift @words; } }; return \@ngrams, \@indxs } sub tybalt89 { my ( @ngrams, @indxs ); $str =~ /(?<!\S)\S+(?: \S+){$low,$high}?(?!\S)(?{ push @ngrams, $&; push @indxs, $` =~ tr| ||; })(*FAIL)/; return \@ngrams, \@indxs } __END__ # demo output skipped Rate AnomalousMonk tybalt89 johngg + j AnomalousMonk 4345/s -- -33% -34% + -59% tybalt89 6468/s 49% -- -2% + -39% johngg 6590/s 52% 2% -- + -38% j 10560/s 143% 63% 60% + --

    I pumped difficulty level just very slightly up, or otherwise (as example in OP) it would be ridiculous to optimize what's very fast as is. Note, J sentence is interpreted every time, so to be fair I should have wrapped other players into string eval. I tried to preserve other monks code while bending it to serve "array of ngrams, array of indexes" goal, where possible. Sorry if I messed. As I understand, to modify J phrase to work with Unicode text and/or return character offsets would be easy. + Of course my J is absolutely unoptimized, as I'm total beginner. The moral, there is very powerful tool and now I know how to use it from Perl :)

    Edit: fixed spelling, sorry.

Re: improving speed in ngrams algorithm (updated)
by LanX (Archbishop) on Jun 11, 2019 at 11:04 UTC
    Please ignore! Misunderstood question.

    My answer treats ngrams on characters not words.


    A regex should be faster, this demo in the debugger for n=3 should give you a start.

    DB<30> $str = join "", a..l DB<31> @res=() DB<32> for my $start (0..2) { pos($str) =$start; push @res, $str =~ +m/(.{3})/g } DB<33> x @res 0 'abc' 1 'def' 2 'ghi' 3 'jkl' 4 'bcd' 5 'efg' 6 'hij' 7 'cde' 8 'fgh' 9 'ijk'

    NB:

    • the order is not preserved
    • you may want to change the regex to not match whitespaces or punctuation.

    (I know it's possible in a single regex without looping over start by playing around with \K or similar. I'll leave it to the regex gurus like tybalt to show it ;-)

    HTH! :)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

    update

    In case you want really want to include non-letters try unpack

      A regex should be faster
      I would already doubt that a regex is faster than accessing array elements in normal circumstances, but here you seem to have missed the fact that the n-grams are made of words rather than chars. So your regex becomes: /((\w+\s?){3})/g where each char of (part of) the string are checked to find spaces. In IB2017's solution this is done once by the split.

      I know it's possible in a single regex without looping over start by playing around with \K or similar
      Look ahead assertions can help:
      DB<7> say for 'perlmonks' =~ /(?=(.{3}))./g per erl rlm lmo mon onk nks
      But it becomes cumbersome when working with words /(?=((\w+\s?){3}))\w+/g and probably not faster.

      In case you want really want to include non-letters try unpack
      unpack would probably be among the fastest solutions for character n-grams indeed.

        Seems like I misread the sample code.

        I saw split // not split / /

        That's why added the NB part saying to exclude white spaces and punctuation (which isn't done in the OP s code)

        I haven't run ° it but the code looks broken to me if the split wasn't meant to be per character. @string holding words doesn't make sense to me!

        I don't think that you can effectively process a natural language without regex.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        Update

        °) I ran it on my mobile and the output shows that the OP is looking for n words in a row. Hence we both misunderstood his definition of n gram

        START INDEX: 0 :this is START INDEX: 1 :is the START INDEX: 2 :the text START INDEX: 3 :text to START INDEX: 4 :to play START INDEX: 5 :play with START INDEX: 0 :this is the START INDEX: 1 :is the text START INDEX: 2 :the text to START INDEX: 3 :text to play START INDEX: 4 :to play with

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11101225]
Approved by johngg
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (8)
As of 2019-06-27 02:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Is there a future for codeless software?



    Results (111 votes). Check out past polls.

    Notices?