Second CountZero's objection to lack of input data.
However, when I run your code, supplying a test file (multi-char strings, some of which satisfy either of two contradictory definitions of bigrams* and some non-bigram strings such as nobigram, each separated by 2 newlines), your code returns " : 1" for pairs of words -- each concatenated with the following word (except in the case of "nobigram" which is first concatenated with the preceding word and then with the following word).
# output:
Bigrams
aabaaacc : 1
ccnobigram : 1
nobigramabcda : 1
abcdabcdaaaa : 1
bcdaaaanobigram : 1
nobigrambbhbbbb : 1
bbhbbbbbbb : 1
bbbaba : 1
abaaaaabbb : 1
aaaabbbcccccccccc : 1
cccccccccccbbbaaaccc : 1
cbbbaaacccccbbaaccc : 1
I cannot reproduce your "loop is ended after splitting EXCEPT (and then not perfectly) by removing the doubled curly braces at Lns 9-10 and 22-23, in which case only a single instance of the final two words of sample data are returned (along with the
" : 1"). (Update: That's not a correct count for either definition cited for a bigram)
As to "why found is used," it seems possible, in the overly limited context you've provided, that it's intended to be a counter -- a variable in which to stash the number of bigrams found. I realize that seems exessivly obvious, but, IMO, it's the only obvious possible-answer.
In any case, if counting is your intent, please see http://search.cpan.org/~emorgan/Lingua-EN-Bigram-0.01/lib/Lingua/EN/Bigram.pm (or some fork for the language in which your interests lie).
*Definitions vary:
- Wikikpedia says "A bigram or digram is every sequence of two adjacent elements in a string of tokens, which are typically letters, syllables, or words; they are n-grams for n=2.
while
- The Free OnLine Dictionary defines a bigram as a two-letter word (FOL is NOT, IMO, a reliable source, but Merriam-Webster and others define bigram only for those using paid access or their (one-shot) free trial).
For clarity, here is the content (verbatim) of the text file:
aabaaa
cc
nobigram
abcda
bcdaaaa
nobigram
bbhbbbb
bbb
aba
aaaabbb
cccccccccc
cbbbaaaccc
ccbbaaccc
|