Do you know where your variables are? PerlMonks

### How to count the vocabulary of an author?

by karlgoethebier (Abbot)
 on Jun 11, 2021 at 11:12 UTC Need Help??

karlgoethebier has asked for the wisdom of the Perl Monks concerning the following question:

I have no serious idea for the moment. And done nothing so far. Background is that Kurt Schumacher claimed that Goethe had a vocabulary of about 29.000 words and Adenauer only had a vocabulary of about 500 words.

Update: Thanks to all for the kind and inspiring replies. I guess Lingua::Stem is the way to go. I‘ll open another thread about tokenizing.

«The Crux of the Biscuit is the Apostrophe»

• Comment on How to count the vocabulary of an author?

Replies are listed 'Best First'.
Re: How to count the vocabulary of an author?
by choroba (Archbishop) on Jun 11, 2021 at 11:14 UTC
You need a stemmer for the given language and corpus of texts by the author. Read the texts, tokenise them into words, stem each word and store the stem in a hash. At the end, count the number of keys in the hash.

map{substr\$_->[0],\$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

You make it sound so easy choroba :)

Having done a few (simpler) things with language, I guess that finding the stem of each word is the trickiest part.

Well, I have a PhD in mathematical linguistics. Stemming was done in the first year ;-)

map{substr\$_->[0],\$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
There are look-up tables for that.

And even if they didn't exist you can derive most stems by statistical analysis, at least with the Indo-European languages I know.°

Good enough for a word count.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

°) because they have in most cases a fixed stem. I suppose Finnish to be much harder...

Re: How to count the vocabulary of an author?
by cavac (Curate) on Jun 11, 2021 at 11:33 UTC

It's science. Scientific paper about word stemming in the german language. After looking at available software, they developed their own stemmer, on Github. Looks like it supports different programming languages, including perl.

Edit: I'm pretty sure if you include the huge vocabulary of curse words he must have known (due to him being a german and loosing two world wars), i'm pretty sure there are a lot more than 500 words he knew.

perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'

Yeah, of course. How can i forget?

Really, it hasn't been the same since my butler Jeeves retired and grandma Yahoo went to prison for financial fraud.

perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'

This is sort of an amendment, because it struck me that word counting is much harder than it looks. Let's take a look at the text "little boy" in three different contexts:

First: Let's consider a 10 year old boy living in Rhode Island: He knows the meaning of the words "little" and "boy" and he heard of a bomb named "Little Boy" in school. It made the U.S. win some war a long time ago and every year there is a celebration. So, three words?

Second: A 10 year old British girl. She certainly knows the words "little" and "boy", but she never heard of the things that happened in Japan in 1945. Those things are taught at a later age. So, two words?

Third: A 10 year old girl in Japan. She doesn't know a single word of english. Neither the words "little" nor "boy" have any meaning to her. But every time she walks to school, she walks past the Genbaku dome. She asked her parents about it, and now she knows that an awful and terrifying machine named "Little Boy" killed her great-grandparents and destroyed her city. So, one word?

perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'
because it struck me that word counting is much harder than it looks

There is another dimension to what constitutes a word and how they can be differentiated programmatically. They require context which takes us into a totally different level of complexity.

Let's take the seemingly simple word post

When you read the word you might think of what someone puts on social media, or perhaps the mail that gets delivered to your door. But equally you may imagine the pole in the ground that keeps your fence upright. Perhaps you have been given a new post at work as your role has changed due to the company being able to post a good profit. Of course, at one time we didn't need to worry - but post the advent of computing, we do!

"Edit: I'm pretty sure if you include the huge vocabulary of curse words he must have known (due to him being a german and loosing two world wars), i'm pretty sure there are a lot more than 500 words he knew."

That's a very stupid comment.

Why? Those "official" counts only count the written word. We often use a different vocabulary when communicating verbally. Some authors intentionally limit their written vocabulary to make their works accessible to a broader public.

Cursing and other emotional expressions can often emphasize specific meanings of things said. The written equivalent in modern internet terms would be emoticons. Insofar as the Unicode consortium is concerned, these count as written expressions that are meaningful to the context of the discussion.

It also depends on the cultural context and the circumstances as to when and what forms of cursing are acceptable or sometimes even required in a conversation. English speakers, and especially people in the United States, are much more prudent, compared to some other cultures. There are many groups out there that look to as outsider if you don't use a very frank and curseword-ridden way to talk to them. So, if you want to be accepted as equal (for example, because you need them on your project), you better have to learn their way of communicating - there shorthands, their curses, whathaveyou.

It's a bit like driving a vehicle. You have areas in the world where everything is very regulated and everyone keeps to the rules. Than you have the seemingly chaotic i-honked-first-so-i-go-first way it works in other areas of the world. If you go there and rent a car, you better learn their ways and honk that horn.

It's the same way with politicians. They may have voters from different cultural regions and context. So politicians better all those different ways their voters communicate and adapt when visiting the region. "I am one of you" is a big vote seller. But this might not reflect in a politicians writing and official speeches.

As for Adenauer, a lot of his potential voters were soldiers and people from many different cultural groups. I'm pretty sure he took the time to adapt his vocabulary when meeting with local groups. But as i said, this might not reflect in his publications, as they were for a much broader public.

Edit: Also take time to watch the Tom Scott video "What counts as a word?"

perl -e 'use Crypt::Digest::SHA256 qw[sha256_hex]; print substr(sha256_hex("the Answer To Life, The Universe And Everything"), 6, 2), "\n";'
Re: How to count the vocabulary of an author?
by bliako (Prior) on Jun 11, 2021 at 20:32 UTC

There are stemming (basically chopping off letters from the end of a word in order to arrive to a basis) and tagging (find out which part of speech a word is, e.g. verb) packages in cpan and specific to different languages. e.g. Lingua::*

Then ask uncle NSA and aunty CIA for the corpus, they keep meticulous records for all major european politicos' conversations.

Re: How to count the vocabulary of an author?
by perlfan (Vicar) on Jun 12, 2021 at 15:30 UTC
What your you attempting to do? The book, PRACTICAL TEXT MINING WITH PERL, is very much worth the money. As far as objects go, finding the set of "uniq" words is straightforward in Perl. What's more interesting and applicable to comparing tests require less trivial-to-implement preprocessors such as Lingua::EN::Ngram or Lingua::EN::Tagger.
Re: How to count the vocabulary of an author?
by karlgoethebier (Abbot) on Jun 14, 2021 at 16:34 UTC

Update 2:

Here is a first simple solution. Anybody needs a starting point:

```#!/usr/bin/env perl

use strict;
use warnings;
use feature qw(say);
use Data::Dump;
use Lingua::Stem qw(stem);

undef \$/;

my \$text = <DATA>;

say \$text;

\$text = lc \$text;

\$text =~ s/\n+/ /g;

say \$text;

\$text  =~ s/[:;'!?.,]+//g;

say \$text;

my @words = split / /, \$text;

dd \@words;

Lingua::Stem::set_locale('de');

say Lingua::Stem::get_locale;

my \$stems = stem(@words);

dd \$stems;

my %vocabulary = map {\$_ => 1} @\$stems;

dd  \%vocabulary;

say scalar keys %vocabulary;

__DATA__
Ich Bin Der Geist, Der Stets Verneint!
Und Das Mit Recht; denn alles, was entsteht,
Ist wert, daß es zugrunde geht;
Drum besser wär's, daß nichts entstünde.
So ist denn alles, was ihr Sünde,
Zerstörung, kurz, das Böse nennt,
Mein eigentliches Element.

It isn't so easy as one might think: Simply counting the words with wc doesn't return the vocabulary. And Lingua::Stem thinks that Ist and ist are different stems for example. And how to filter out the real text from sources which contain a preface, index, bla? And so on.

Some may ask why i waste my time with this issue. It has to do with politics. As this isn't a forum about politics i skip the details.

I was a little bit inspired by what Jill Lepore analogously wrote about facts in her splendid book These Truths: A History of the United States about facts: "Show me yours and i'll show you mine." Basically the same game that we played with our cousins when we were nasty little boys. Discussion later.

«The Crux of the Biscuit is the Apostrophe»

Ok, I'll play 'nasty little boy' too (I remember!)

Of course I had to try the stemming that is built-in in PostgreSQL's full-text search (FTS). I had'nt used it for a while; so this is just playing with it. Below are results of stemming and the distinction between words and stop-words.

I think this FTS-stuff uses snowball, and I don't know how recent the vocabulary is. (UPDATE: I see regular snowball-related updates (every few months) in the PostgreSQL git log so I now think its snowball stuff is reasonably up-to-date)

```
-- Below are three chunks/resultsets:

-- 2. Real words:
--    select .. from ts_debug('german', '\$yourtxt')
--    where lexemes > 0
-- 3. Stop-words:
--    select .. from ts_debug('german', '\$yourtxt')
--    where lexemes = 0

txt
----------------------------------------------
Ich Bin Der Geist, Der Stets Verneint!      +
Und Das Mit Recht; denn alles, was entsteht,+
Ist wert, daß es zugrunde geht;             +
Drum besser wär's, daß nichts entstünde.    +
So ist denn alles, was ihr Sünde,           +
Zerstörung, kurz, das Böse nennt,           +
Mein eigentliches Element.
(1 row)

alias   |    token     | dictionary  |  lexemes
-----------+--------------+-------------+------------
asciiword | Geist        | german_stem | {geist}
asciiword | Stets        | german_stem | {stet}
asciiword | Verneint     | german_stem | {verneint}
asciiword | Recht        | german_stem | {recht}
asciiword | entsteht     | german_stem | {entsteht}
asciiword | wert         | german_stem | {wert}
asciiword | zugrunde     | german_stem | {zugrund}
asciiword | geht         | german_stem | {geht}
asciiword | Drum         | german_stem | {drum}
asciiword | besser       | german_stem | {bess}
word      | wär          | german_stem | {war}
asciiword | s            | german_stem | {s}
word      | entstünde    | german_stem | {entstund}
word      | Sünde        | german_stem | {sund}
word      | Zerstörung   | german_stem | {zerstor}
asciiword | kurz         | german_stem | {kurz}
word      | Böse         | german_stem | {bos}
asciiword | nennt        | german_stem | {nennt}
asciiword | eigentliches | german_stem | {eigent}
asciiword | Element      | german_stem | {element}
(20 rows)

alias   | token  | dictionary  | lexemes
-----------+--------+-------------+---------
asciiword | Ich    | german_stem | {}
asciiword | Bin    | german_stem | {}
asciiword | Der    | german_stem | {}
asciiword | Der    | german_stem | {}
asciiword | Und    | german_stem | {}
asciiword | Das    | german_stem | {}
asciiword | Mit    | german_stem | {}
asciiword | denn   | german_stem | {}
asciiword | alles  | german_stem | {}
asciiword | was    | german_stem | {}
asciiword | Ist    | german_stem | {}
word      | daß    | german_stem | {}
asciiword | es     | german_stem | {}
word      | daß    | german_stem | {}
asciiword | nichts | german_stem | {}
asciiword | So     | german_stem | {}
asciiword | ist    | german_stem | {}
asciiword | denn   | german_stem | {}
asciiword | alles  | german_stem | {}
asciiword | was    | german_stem | {}
asciiword | ihr    | german_stem | {}
asciiword | das    | german_stem | {}
asciiword | Mein   | german_stem | {}
(23 rows)

Not perfect but more useful than I thought it would be without any work.

Very cool! Thanks!

«The Crux of the Biscuit is the Apostrophe»

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11133775]
Approved by cavac
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2021-08-01 16:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
My primary motivation for participating at PerlMonks is: (Choices in context)

Results (13 votes). Check out past polls.

Notices?