Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Splitting text into syllables

by crenz (Priest)
on May 14, 2003 at 01:25 UTC ( #257945=perlquestion: print w/replies, xml ) Need Help??

crenz has asked for the wisdom of the Perl Monks concerning the following question:

I hope I'll be forgiven this slightly OT question... it is more related to Linguistics than Perl, actually.

I am looking for a good solution to split up German text into its syllables. I found Tex::Hyphen, which can deal with German texts, but it doesn't mention the new German spelling. Does that module seem to give reasonable results, or are there better solutions out there?

Also, I'm trying to attach a "length" to these syllables (e.g. "Maß" is longer than "Hass"). I'm not a linguist, so I just thought I might try to look for certain patterns and attach weights to them. E.g. a syllable with "ie" will be counted "long", a syllable with "ss" will be counted short (using the new German spelling). Does anybody know a comprehensive list of German syllables with such "weights", or a better approach for this problem? (The "weights" don't need to be 100% accurate; I want to use them for music generation.)

Replies are listed 'Best First'.
Re: Splitting text into syllables
by halley (Prior) on May 14, 2003 at 03:04 UTC

    First place to look is Lingua::DE::, and if you end up writing something, submit it to that namespace. Lingua::EN::* is growing nicely, but there are a few other languages starting to get some love.

    --
    [ e d @ h a l l e y . c c ]

      I did research the Lingua::DE:: namespace, but there seemed to be nothing relevant. (Not even in other language's namespaces, for that matter.)

Re: Splitting text into syllables
by benn (Vicar) on May 14, 2003 at 10:08 UTC
    Lingua::EN::Syllable has some interesting (and not very large) code to estimate the number of syllables in a word / passage, but he only claims 85-90% accuracy. Maybe that would be a good starting point for your splitting up algorithm, but I would imagine this would be harder in German than English. I suspect there are so many exceptions in a modern language that you'll need to use a dict lookup at some point.

    It occurs to me that 'real' dictionaries usually have a phonetic spelling part, using the Internatial Phonetic Alphabet (with 'metacharacters' showing glottal stops, syllabic parts etc.) - if you could grab a German dictionary, maybe you could write a routine to split this - probably a lot simpler than trying to parse the 'proper' spelling.

    Cheers,
    Ben.

    PS I'd be interested to see it if you did, as this would then presumably work with all languages - Lingua::International::SyllableSplit, maybe :) . Ben.

Re: Splitting text into syllables
by Bod (Curate) on Jan 25, 2022 at 22:43 UTC
    I am looking for a good solution to split up German text into its syllables

    Within my company, we produce a lot of marketing content and aim to constantly improve the quality of this copy. So I insist on a Flesch Kincaid Grade Score of no more than 7.5. But, the problem is getting a reliable and consistent Grade Score. We use The Hemingway App. But I wanted a solution attached to our content creating platforms which are written in Perl. So I started using Lingua::EN::Fathom which uses Lingua::EN::Syllable.

    The first thing I noticed was that Lingua::EN::Fathom and The Hemingway App disagree on the Grade Score.

    But, it is helpful to have a browser-side real-time calculation of the Grade Score. Not to have to keep sending AJAX requests back to a Perl script on the server. So I searched and found a Javascript solution. It works...but is even further out on its calculation of the Grade Score.

    After some investigation, I traced the discrepancies to the way that these three methods calculate the syllable count...they all do it very differently!

    So I will probably end up writing my own Grade Score calculator that uses the same method of calculation in both Perl and Javascript. It doesn't matter too much how accurately it reflects other tools. What is more important is that the two agree on any given piece of text. Then we can adjust the company rule on Grade Score to reflect what the tools are saying. But this has moved down the priority list as we have bought a subscription to Grammarly which is doing a good job of improving the quality and consistency of our written content.

      OT: using javascript makes the user's computer calculate the score, whereas using Perl (at the backend) makes the backend calculate it, and website owner pays for it. I am glad to see someone placing quality above cost.

      edit: of course js can be used at the backend just like anything else, i just assumed it is browser-running js.

        i just assumed it is browser-running js

        Yes, I did mean browser Javascript

        The main rationale for trying to use a JS solution is time. When typing, there is a (sometimes significant) time lag between what is typed and the displayed Grade Score due to the AJAX calls. With JS running in the browser, the delay is negligible. A decreased load on the server and network are secondary, but very real, benefits.

Re: Splitting text into syllables
by agentv (Friar) on May 14, 2003 at 12:41 UTC
    crenz says: a syllable with "ie" will be counted "long", a syllable with "ss" will be counted short (using the new German spelling).

    ...it's also OT a bit, but could you say more about "the new German spelling?" I haven't heard of that before (not being a student of German) and I'm curious if it simply means modern (ie. from sometime in the 20th century,) or if it's a very recent change (as in something from the last 25 years).

    I have to agree with the conclusion that you might be best using a dictionary that provides phonetic spelling for the words that concern you. Those are typically hypenated anyway, and the accent marks may lend other useful information to your system.

    In fact, the access to emphasis information may also be useful in other pursuits, but certainly if you're trying to create something that can generate reasonable lyrics or poetry, you may want to include meter in your calculations.

    ...All the world looks like -well- all the world, when your hammer is Perl.
    ---v

      I hope that crenz forgives me for jumping in, but I was involved German dictionary typesetting around the time of the spelling reform. I'm also a TeX fiend, so I know my way around hyphenation.

      Sometime in the mid-late 1990s, Germany decided to simplify its spelling, and get rid of some of the weirder perceived idiosyncracies. The official change came in August 1998, according to this informative article from german.about.com.

      One of the most visible changes was cutting down on the use of the good old "sharp S" symbol, ß. No longer will so many foreigners to think that German for street was pronounced "strabe".

      Several compound words were also split up into their component words. For this, the typesetters of the world thank you, for setting German in a narrow measure was always a challenge.

      The change (I think; the input of a native German speaker would be appreciated) in hyphenation was interesting. One example is the ck formation would hyphenate to k-k, so the actual spelling of the word used to change.

      I'd be very surprised if there weren't new TeX hyphenation dictionaries for German. TeX has a very large following in Germany. If it's not on CTAN, I'd be amazed.

      Oh, and before people start corresponding with me in German, I don't have any. I might know how to typeset the language, I can sort-of read it, but replying is waaaay beyond me...

      --
      bowling trophy thieves, die!

        The ß versus ss rule is actually my favourite new rule :). They sound the same, but the old rule used to be quite arbitrary. Actually, I think there was none -- you just had to learn by heart which word uses which spelling. But now, there is a clear rule for their use: In layman's (ie. non-linguist's) terms, ss is written after a short vowel, and ß is written after a long vowel. For an example of what I mean with "short" and "long" vowels, consider the ee/i in "deed" and "did". I find this rule really easy to use, and I like it because it eliminates a few exceptions to the rule that in German, things pronounced the same way are written the same way. (Compare that to English! *sigh*)

        Some people don't get it and complain that they should have abolished ß at all. I don't agree. For example, we write "Masse" (mass) and "Maße" (dimensions). Without ß, there would be no way to differentiate.

        Apart from that rule, there have been a number of very good and simplified new rule, and a number of very bad new rules. Most people have accepted the reform by now, but still have mixed feelings about it -- including me. I still feel the benefits outweigh the disadvantages, though.

        Regarding Tex, there is a dictionary for the new spelling called "ngerman". Almost all German TeX users probably use it by now. I just don't know whether it will work with the module mentioned.

        Wow. That was a perfectly proportional response to the question. I don't see how anybody could object to your "jumping in."

        I really wish there were a chance that we could simplify English spellings. There was talk of that when I was young, but it feels like nobody is really behind the improvement. It's okay I guess, we are becoming an increasingly less literate society so it soon won't matter how we spell things.

        N0 wut I m33n d00ds?

        CYA

Re: Splitting text into syllables
by programmingzeal (Sexton) on Jan 25, 2022 at 16:16 UTC
    My goal is similar to this. This post is pretty old so may be there must be developments since then in syllabification techniques. I want to parse any text in Ukrainian, Russian and English into syllables via Perl. So what approach should I use to achieve this? Are there any libraries available or do I have to do a dictionary lookup? Also, whether only hyphenation is needed for syllabification?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://257945]
Approved by The Mad Hatter
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2022-08-07 19:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?