Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer

Theory time: Sentence equivalence

by BUU (Prior)
on Dec 17, 2005 at 11:17 UTC ( #517465=perlquestion: print w/replies, xml ) Need Help??
BUU has asked for the wisdom of the Perl Monks concerning the following question:

Another pet project of mine has recently branched to involve trying to determine whether sentences are "equivalent", that is, asking or referring to the same thing, despite different grammatical structures and so forth. Obviously this is a Very Hard Task and I'm not really looking to solve it perfectly. That would be very awesome though, but rather unlikely.

Anyway, my general idea at the moment basically involves stripping stop words, stemming the remainder and comparing the resultant set with my target set of the other sentence. This will probably be fairly fast for the small case and it's easy to think of and implement, but it has some flaws, for example, how do I store the resultant sets so I could easy reference them again? If I've got hundreds of thousands of these sets and I want to find the ones that match a new set I just created, how do I do that?

So that's my idea. Anyone else have any useful ideas? Pointers to research? Clever algorithms?

Replies are listed 'Best First'.
Re: Theory time: Sentence equivalence
by Ovid (Cardinal) on Dec 17, 2005 at 16:49 UTC

    While the name doesn't suggest being useful, you might find Lingua::LinkParser beneficial. It not only parses sentences, but also identifies each word's role in the sentence -- or how the words link together. Thus, you'll know a sentence about the "dog days of summer" isn't about dogs. You can read a detailed intro here.

    The module (the core of which integrates with C code that you install separately), generates data structures that break a sentence down into component parts, though there's a bit of work you' have to do to wade through all the types it lists. Here's a short example (the module has docs for all of those link types):

    use Lingua::LinkParser; use strict; my $parser = new Lingua::LinkParser; # create the parser my $text = "Moses supposes his toses are roses."; my $sentence = $parser->create_sentence($text); # parse the sentence my $linkage = $sentence->linkage(1); # use the first linkage print $parser->get_diagram($linkage); # print it out __END__ +-------------------------Xp------------------------+ | +------Ce------+ | +---Wd--+---Ss---+ +--Dmc--+--Spx--+--Opt-+ | | | | | | | | | LEFT-WALL Moses supposes.v his toses[!].n are.v roses.n .


    New address of my CGI Course.

Re: Theory time: Sentence equivalence
by BrowserUk (Pope) on Dec 17, 2005 at 15:04 UTC

    You may think that by asking a limited question, "Are these sentences equivalent", you are avoiding most of the problems that come with natural language processing.

    That simple question has only black and white answers, and that requires 100% certainty. And that implies a very great level of understanding. For example:

    • This is the thing.
    • This isn't the thing.
    • The thing is this.
    • Isn't this the thing.

    Four sentences of just four words each and only two characters separating them, but four completely distinct meanings. And you could substitute any noun for 'thing', and the sentences remain equally distinct.

    I do not think that there is any statistical measure or weighting algorithm that will allow you determine their equivalence or lack thereof, nor even give a probability of equivalence that would be useful in determining that.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Theory time: Sentence equivalence
by Not_a_Number (Prior) on Dec 17, 2005 at 21:32 UTC
    Pointers to research?

    At one level, this is what research in computational linguistics has been working on for years. Before attempting to invent (note that I didn't say reinvent :) this particular wheel by yourself, try googling for "constraint based grammars", "machine translation", "semantic equivalence" or whatever.

    Moreover, I think you have a misconception over what constitutes "equivalence". The non-trivial issues of word order and punctuation have been pointed out above. A further problem is that sentences can be "equivalent", that not only have different grammatical structures but also contain very different words. For example:

    - In New York, following the latest Fed rate cut, stocks rose across the board.

    - The Federal Bank's further lowering of base rates boosted the NYSE and the NASDAQ.

    - Wall Street reacted positively after Greenspan reduced interest rates again.

    may be considered strictly equivalent, at least for some definition of equivalence, despite containing very few words in common.

    Anyway, best of luck in your project. I, for one, would be very interested in seeing your results.


    Update: Fixed typos, changed sample sentences slightly to make them more "equivalent".

      I don't think you realize what you've gotten into. Research into natural language processing (the official term) is at least 40 years old. Browse on citeulike for NLP, NLG (natural language generation), and NLU (...understanding). Also search Google Scholar.

      The topic is extremely theoretical and heavily realizes on Chompsky's theories of grammars, as well as tree theory. There are two main approaches: McKeown's (a name you will see a lot) is usually based on nested templates (IIRC) and requires a massive database of grammatical structures. Other approaches rely on deep semantic representations of the knowledge that is implied by grammatical structures.

      A couple of things that you have failed to consider, each of which has massive amounts of discussion in the literature:
      - Referring expressions (him, her, that, etc.)
      - Focus (what's the subject? Is the sentence really _about_ the subject?)
      - homonymns (semantic homonymns, at least)
      - fifty billion other things...

      The dismal state of machine translation ought to indicate how far yet we have to go. Babelfish is rather state of the art, actually.

      Good luck.

        On mondays, I think linguists -- Chomsky, Pinker, the lot of them -- are pseudoscientists peddling bumhug. Kind of like certain bad apple social scientists and continental philosophers -- see The Sokal Affair.

        On tuesdays, I think maybe linguists are more like physicists than the wizards sokal pulled back the curtain on.

        Enh, who knows. But a good place to start on the bad news of solving linguistics problems with computing is Pinker's The Language Instinct. Bumhug he may be, on mondays, but I liked the book it a lot anyways :)

        Babelfish is rather state of the art, actually
        ...mmmm not really. Babelfish/Systran may be the biggest fish in the pond commerically, but they are hardly state of the art. The big splashes technically are being made by people looking at statistical techniques such as Language Weaver (the company I work for), ISI, and Google.

        As for the original topic. I will just echo some of the other posters and warn you that you are taking some tiny first (mis?)steps on a journey whose destination is a long ways off. This is an intensely complex and interesting topic you could spend your entire life on if you are sufficiently interested/motivated/compensated.


        "If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."

Re: Theory time: Sentence equivalence
by jZed (Prior) on Dec 17, 2005 at 16:10 UTC
    Don't forget to deal with word-order: Man bites dog. ... Dog bites man.
Re: Theory time: Sentence equivalence
by CountZero (Bishop) on Dec 17, 2005 at 20:07 UTC
    And punctuation:

    But father says: "John you are sitting on it".
    "But father," says John, "you are sitting on it".


    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Theory time: Sentence equivalence
by TedPride (Priest) on Dec 17, 2005 at 14:37 UTC
    Large sets are actually easier to use, from a searching standpoint. As each sentence is entered, identify what part of speech each word is (I assume you'll be doing that already). Now store word counts for each word as each part of speech, along with a list of the sentences that word belongs to.

    I LOVE bread and butter.
    LOVE is beautiful.

    Love in one sentence is a verb and in the other a subject. The two should be kept separate.

    As your sample grows, you should be able to get fairly accurate matches by adding up the weights for each word/part of speech x the number of times the word appears in the sentence. You only need to look at sentences containing key words and a match percentage over a certain level, which means that your heavy-duty algorithm will probably never need to do more than a few dozen sentences even with hundreds of thousands of sentences.

Re: Theory time: Sentence equivalence
by eweaver (Sexton) on Dec 18, 2005 at 07:37 UTC

    In terms of technical implementation of your thing, why don't you concatenate the stems together in alphabetical order (perhaps with a delimiter) and use that as a hash key to matching sentences? That should scale pretty well, and be fast to search.

Re: Theory time: Sentence equivalence
by dimar (Curate) on Dec 20, 2005 at 15:29 UTC

    In addition to the excellent posts in this thread, there are the corrolary responses in The (futile?) quest for an automatic paraphrase engine. Eventually, someone is going to review all the musings on this Very Hard Question, write a book on it, and solve it.

    Or not.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://517465]
Approved by GrandFather
Front-paged by bart
and the pool shimmers...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2018-05-25 00:30 GMT
Find Nodes?
    Voting Booth?