http://www.perlmonks.org?node_id=11133859


in reply to Re: How to count the vocabulary of an author?
in thread How to count the vocabulary of an author?

Ok, I'll play 'nasty little boy' too (I remember!)

Of course I had to try the stemming that is built-in in PostgreSQL's full-text search (FTS). I had'nt used it for a while; so this is just playing with it. Below are results of stemming and the distinction between words and stop-words.

I think this FTS-stuff uses snowball, and I don't know how recent the vocabulary is. (UPDATE: I see regular snowball-related updates (every few months) in the PostgreSQL git log so I now think its snowball stuff is reasonably up-to-date)

-- Below are three chunks/resultsets: -- 1. Your text -- 2. Real words: -- select .. from ts_debug('german', '$yourtxt') -- where lexemes > 0 -- 3. Stop-words: -- select .. from ts_debug('german', '$yourtxt') -- where lexemes = 0 txt ---------------------------------------------- Ich Bin Der Geist, Der Stets Verneint! + Und Das Mit Recht; denn alles, was entsteht,+ Ist wert, daß es zugrunde geht; + Drum besser wär's, daß nichts entstünde. + So ist denn alles, was ihr Sünde, + Zerstörung, kurz, das Böse nennt, + Mein eigentliches Element. (1 row) alias | token | dictionary | lexemes -----------+--------------+-------------+------------ asciiword | Geist | german_stem | {geist} asciiword | Stets | german_stem | {stet} asciiword | Verneint | german_stem | {verneint} asciiword | Recht | german_stem | {recht} asciiword | entsteht | german_stem | {entsteht} asciiword | wert | german_stem | {wert} asciiword | zugrunde | german_stem | {zugrund} asciiword | geht | german_stem | {geht} asciiword | Drum | german_stem | {drum} asciiword | besser | german_stem | {bess} word | wär | german_stem | {war} asciiword | s | german_stem | {s} word | entstünde | german_stem | {entstund} word | Sünde | german_stem | {sund} word | Zerstörung | german_stem | {zerstor} asciiword | kurz | german_stem | {kurz} word | Böse | german_stem | {bos} asciiword | nennt | german_stem | {nennt} asciiword | eigentliches | german_stem | {eigent} asciiword | Element | german_stem | {element} (20 rows) alias | token | dictionary | lexemes -----------+--------+-------------+--------- asciiword | Ich | german_stem | {} asciiword | Bin | german_stem | {} asciiword | Der | german_stem | {} asciiword | Der | german_stem | {} asciiword | Und | german_stem | {} asciiword | Das | german_stem | {} asciiword | Mit | german_stem | {} asciiword | denn | german_stem | {} asciiword | alles | german_stem | {} asciiword | was | german_stem | {} asciiword | Ist | german_stem | {} word | daß | german_stem | {} asciiword | es | german_stem | {} word | daß | german_stem | {} asciiword | nichts | german_stem | {} asciiword | So | german_stem | {} asciiword | ist | german_stem | {} asciiword | denn | german_stem | {} asciiword | alles | german_stem | {} asciiword | was | german_stem | {} asciiword | ihr | german_stem | {} asciiword | das | german_stem | {} asciiword | Mein | german_stem | {} (23 rows)

Not perfect but more useful than I thought it would be without any work.