Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: finding number of contiguous letters

by graff (Chancellor)
on May 23, 2007 at 05:36 UTC ( #616940=note: print w/replies, xml ) Need Help??


in reply to finding number of contiguous letters

If/when someone posts a Benchmark in this thread, I would predict that a substr() approach will come out ahead. But you might also want to look at Text::Ngram -- maybe instead of an array of 3-grams, you actually want a hash, which has each 3-gram as a hash key, and its frequency of occurrence in the string as the value. From the module's man page:
use Text::Ngram qw(ngram_counts add_to_counts); my $text = "abcdefghijklmnop"; my $hash_r = ngram_counts($text, 3); # Window size = 3 # $hash_r => { abc => 1, bcd => 1, ... }
Of course, if you need to keep track of the order in which the 3-grams occur, having them stored as hash keys would not be very useful.

(BTW, you said "I need to take a string and find the number of 3 contiguous letters" -- to be precise, the number of 3-grams in a string is simply  length( $string ) - 2 ;)

Replies are listed 'Best First'.
Re^2: finding number of contiguous letters
by GrandFather (Sage) on May 23, 2007 at 08:11 UTC

    Right you are ;)

    use warnings; use strict; use Benchmark qw(cmpthese); my $str = "Just Another Perl Hacker"; print join " ", do_substr (), "\n"; print join " ", do_regex (), "\n"; print join " ", do_unpack (), "\n"; print "\n"; cmpthese (-1, { unpack => \&do_unpack, substr => \&do_substr, regex => \&do_regex, } ); sub do_substr { my @parts; push @parts, substr $str, $_, 3 for 0 .. length ($str) - 3; return @parts; } sub do_regex { my @parts = $str =~ /(?=(...))/g; return @parts; } sub do_unpack { my $matches = length ($str) - 2; my $temp = "a3XX" x $matches; my @parts = unpack ($temp, $str); return @parts; }

    Prints:

    Jus ust st t A An Ano not oth the her er r P Pe Per erl rl l H H +a Hac ack cke ker Jus ust st t A An Ano not oth the her er r P Pe Per erl rl l H H +a Hac ack cke ker Jus ust st t A An Ano not oth the her er r P Pe Per erl rl l H H +a Hac ack cke ker Rate regex unpack substr regex 42045/s -- -12% -39% unpack 47764/s 14% -- -31% substr 69161/s 64% 45% --

    Update: added unpack


    DWIM is Perl's answer to Gödel

      I beg to differ...

      For this substr based implementation it's important to have the temporary array @parts. For the regex it's unimportant! So if we leave the assignment to the array out, the regex is way faster!

      Update: Can this be due to the fact that the functions might be called in a scalar context?

      use warnings; use strict; use Benchmark qw(cmpthese); my $str = "Just Another Perl Hacker"; print join " ", do_mapstr (), "\n"; print join " ", do_substr (), "\n"; print join " ", do_regex (), "\n"; print "\n"; cmpthese (-1, { mapstr => \&do_mapstr, substr => \&do_substr, regex => \&do_regex, } ); sub do_mapstr { return map substr($str,$_-3,3),(3..length $str); } sub do_substr { my @parts; push @parts, substr $str, $_, 3 for 0 .. length ($str) - 3; return @parts; } sub do_regex { return $str =~ /(?=(...))/g; }

      Result

      Jus ust st t A An Ano not oth the her er r P Pe Per erl rl l H H +a Hac ack cke ker Jus ust st t A An Ano not oth the her er r P Pe Per erl rl l H H +a Hac ack cke ker Jus ust st t A An Ano not oth the her er r P Pe Per erl rl l H H +a Hac ack cke ker Rate substr mapstr regex substr 25123/s -- -48% -97% mapstr 48651/s 94% -- -95% regex 1003705/s 3895% 1963% --

      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      GrandFather,
      If you are feeling bored, I would be interested in seeing how my unpack solution I came up with for Challenge: Fast Common Substrings does. In a nutshell, we programmatically generate a template to extract all substrings of a given length.

      Cheers - L~R

        More interested than bored. Note that removing the template generation from do_update makes no significant difference to the result.


        DWIM is Perl's answer to Gödel
Re^2: finding number of contiguous letters
by andreas1234567 (Vicar) on May 23, 2007 at 11:10 UTC
    ... the number of 3-grams in a string is simply length( $string ) - 2 ;)
    <picky_mode>
    Only assuming the string does not contain whitespace.
    </picky_mode>
    use strict; print map { (length($_) - 2) . "\n" } split('\s+', "Just Another Perl Hacker"); __END__ 2 5 2 4
      Depending on the application, it might be appropriate to include spaces (and/or punctuation) in the ngram list:
      jus ust st t a an ano not ...
      <picky_mode>
      Only assuming the string does not contain whitespace.
      </picky_mode>

      Why so? If we're dealing with plain sequences of letter from an alphabet, of which one thing called "whitespace" is part, than the latter should not be special in anyway. If you have some paticular application in mind, then YMMV. To quote from Wikipedia:

      For sequences of characters, the 3-grams (sometimes referred to as "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth. Some practitioners preprocess strings to remove spaces, most simply collapse whitespace to a single space while preserving paragraph marks.
        I always thought whitespace was not a letter. I guess I was wrong.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://616940]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2020-10-21 19:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (223 votes). Check out past polls.

    Notices?