Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??
Sorry I do use kakasi as main tool in search engine now. But I use chasen sometimes for individual documents since I am under the impression that it is slower, more flexible, more sophisticated. I just mentioned Chasen because I remembered Nara and clustering, and that gave me chasen.

For those who are not familiar with either tool, they are morphological analyzers of Japanese text. They are similar, though and generally are used to split a chunk of text into individual words (Japanese words are not usually separated by spaces) and to get the phonetic reading of those words (usually in roman alphabet).

Obviously this is enabling technology. The name of Kakasi in fact is a kind of palindrome, in that read backwards phonetically you get the name of a popular front end processor which will take roman alphabet input and interactively pick the correct characters based on that phonetic reading and the context.

I believe Kakasi is focussed more on workaday speed and useability while chasen might be more flexible. In particular there is some interesting use of chasen in document clustering work done in Nara and elsewhere I seem to remember. Couldn't find the exact page but google will help you look at the field. Personally where I use these tools is in custom search engines I build, usually either completely in Perl or with plugins from projects like the above. They are mainly useful it seems in building an inverted index to search a lot of text quickly but I have a small (a few megabytes) Japanese database that works fine just with (Japanese) regexes.

I think it would be very interesting if Perl programmers could easily use state of the art computational linguistics or "A.I." algorithms (besides I guess what are already in perl) to make perl even more intelligent and perhaps automate some of the programming task. For example someone just gave me three nasty scripts to refactor together and update for 5.6.1, maybe perl could learn to tell me "Yep, those are real nasty scripts, better rewrite from scratch," or perhaps give me other insights into the code.

I am no a computational linguist, just interested. There is an awful lot of science there, so if anybody has insights about it please share with the rest of us.


In reply to Re: Re: Re: Perl and Linguistics by mattr
in thread Perl and Linguistics by doonyakka

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others having an uproarious good time at the Monastery: (13)
    As of 2014-07-22 22:01 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (129 votes), past polls