A solution using split, array slices and shift. No idea if it is fast or slow as I haven't run any benchmarks.
use 5.026;
use warnings;
my $text = q{this is the text to play with};
for ( 1 .. 8 )
{
say qq{$_-word ngrams of '$text'};
say for nGramWords( $_, $text );
say q{-} x 20;
}
sub nGramWords
{
my( $nWords, $string ) = @_;
my @words = split m{\s+}, $string;
my $start = 0;
my @nGrams;
while ( scalar @words >= $nWords )
{
push @nGrams, join q{ },
qq{START INDEX: @{ [ $start ++ ] } : },
@words[ 0 .. $nWords - 1 ];
shift @words;
}
return @nGrams;
}
The output.
1-word ngrams of 'this is the text to play with'
START INDEX: 0 : this
START INDEX: 1 : is
START INDEX: 2 : the
START INDEX: 3 : text
START INDEX: 4 : to
START INDEX: 5 : play
START INDEX: 6 : with
--------------------
2-word ngrams of 'this is the text to play with'
START INDEX: 0 : this is
START INDEX: 1 : is the
START INDEX: 2 : the text
START INDEX: 3 : text to
START INDEX: 4 : to play
START INDEX: 5 : play with
--------------------
3-word ngrams of 'this is the text to play with'
START INDEX: 0 : this is the
START INDEX: 1 : is the text
START INDEX: 2 : the text to
START INDEX: 3 : text to play
START INDEX: 4 : to play with
--------------------
4-word ngrams of 'this is the text to play with'
START INDEX: 0 : this is the text
START INDEX: 1 : is the text to
START INDEX: 2 : the text to play
START INDEX: 3 : text to play with
--------------------
5-word ngrams of 'this is the text to play with'
START INDEX: 0 : this is the text to
START INDEX: 1 : is the text to play
START INDEX: 2 : the text to play with
--------------------
6-word ngrams of 'this is the text to play with'
START INDEX: 0 : this is the text to play
START INDEX: 1 : is the text to play with
--------------------
7-word ngrams of 'this is the text to play with'
START INDEX: 0 : this is the text to play with
--------------------
8-word ngrams of 'this is the text to play with'
--------------------
I hope this is of interest.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|