Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: improving speed in ngrams algorithm (updated)

by LanX (Saint)
on Jun 11, 2019 at 11:04 UTC ( [id://11101227]=note: print w/replies, xml ) Need Help??


in reply to improving speed in ngrams algorithm

Please ignore! Misunderstood question.

My answer treats ngrams on characters not words.


A regex should be faster, this demo in the debugger for n=3 should give you a start.

DB<30> $str = join "", a..l DB<31> @res=() DB<32> for my $start (0..2) { pos($str) =$start; push @res, $str =~ +m/(.{3})/g } DB<33> x @res 0 'abc' 1 'def' 2 'ghi' 3 'jkl' 4 'bcd' 5 'efg' 6 'hij' 7 'cde' 8 'fgh' 9 'ijk'

NB:

  • the order is not preserved
  • you may want to change the regex to not match whitespaces or punctuation.

(I know it's possible in a single regex without looping over start by playing around with \K or similar. I'll leave it to the regex gurus like tybalt to show it ;-)

HTH! :)

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

update

In case you want really want to include non-letters try unpack

Replies are listed 'Best First'.
Re^2: improving speed in ngrams algorithm (updated)
by Eily (Monsignor) on Jun 11, 2019 at 12:22 UTC

    A regex should be faster
    I would already doubt that a regex is faster than accessing array elements in normal circumstances, but here you seem to have missed the fact that the n-grams are made of words rather than chars. So your regex becomes: /((\w+\s?){3})/g where each char of (part of) the string are checked to find spaces. In IB2017's solution this is done once by the split.

    I know it's possible in a single regex without looping over start by playing around with \K or similar
    Look ahead assertions can help:
    DB<7> say for 'perlmonks' =~ /(?=(.{3}))./g per erl rlm lmo mon onk nks
    But it becomes cumbersome when working with words /(?=((\w+\s?){3}))\w+/g and probably not faster.

    In case you want really want to include non-letters try unpack
    unpack would probably be among the fastest solutions for character n-grams indeed.

      Seems like I misread the sample code.

      I saw split // not split / /

      That's why added the NB part saying to exclude white spaces and punctuation (which isn't done in the OP s code)

      I haven't run ° it but the code looks broken to me if the split wasn't meant to be per character. @string holding words doesn't make sense to me!

      I don't think that you can effectively process a natural language without regex.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      Update

      °) I ran it on my mobile and the output shows that the OP is looking for n words in a row. Hence we both misunderstood his definition of n gram

      START INDEX: 0 :this is START INDEX: 1 :is the START INDEX: 2 :the text START INDEX: 3 :text to START INDEX: 4 :to play START INDEX: 5 :play with START INDEX: 0 :this is the START INDEX: 1 :is the text START INDEX: 2 :the text to START INDEX: 3 :text to play START INDEX: 4 :to play with

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11101227]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-24 05:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found