Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re^3: please reply

by ww (Archbishop)
on Jan 06, 2013 at 20:50 UTC ( #1011917=note: print w/replies, xml ) Need Help??

in reply to Re^2: please reply
in thread please reply

Second CountZero's objection to lack of input data.

However, when I run your code, supplying a test file (multi-char strings, some of which satisfy either of two contradictory definitions of bigrams* and some non-bigram strings such as nobigram, each separated by 2 newlines), your code returns " : 1" for pairs of words -- each concatenated with the following word (except in the case of "nobigram" which is first concatenated with the preceding word and then with the following word).

# output: Bigrams aabaaacc : 1 ccnobigram : 1 nobigramabcda : 1 abcdabcdaaaa : 1 bcdaaaanobigram : 1 nobigrambbhbbbb : 1 bbhbbbbbbb : 1 bbbaba : 1 abaaaaabbb : 1 aaaabbbcccccccccc : 1 cccccccccccbbbaaaccc : 1 cbbbaaacccccbbaaccc : 1

I cannot reproduce your "loop is ended after splitting EXCEPT (and then not perfectly) by removing the doubled curly braces at Lns 9-10 and 22-23, in which case only a single instance of the final two words of sample data are returned (along with the " : 1"). (Update: That's not a correct count for either definition cited for a bigram)

As to "why found is used," it seems possible, in the overly limited context you've provided, that it's intended to be a counter -- a variable in which to stash the number of bigrams found. I realize that seems exessivly obvious, but, IMO, it's the only obvious possible-answer.

In any case, if counting is your intent, please see (or some fork for the language in which your interests lie).

*Definitions vary:

  • Wikikpedia says "A bigram or digram is every sequence of two adjacent elements in a string of tokens, which are typically letters, syllables, or words; they are n-grams for n=2.
  • The Free OnLine Dictionary defines a bigram as a two-letter word (FOL is NOT, IMO, a reliable source, but Merriam-Webster and others define bigram only for those using paid access or their (one-shot) free trial).

For clarity, here is the content (verbatim) of the text file:

aabaaa cc nobigram abcda bcdaaaa nobigram bbhbbbb bbb aba aaaabbb cccccccccc cbbbaaaccc ccbbaaccc

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1011917]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2018-06-20 13:49 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (116 votes). Check out past polls.