Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

The (futile?) quest for an automatic paraphrase engine

by dimar (Curate)
on May 16, 2004 at 23:44 UTC ( #353840=perlquestion: print w/ replies, xml ) Need Help??
dimar has asked for the wisdom of the Perl Monks concerning the following question:

The Specimen

Before ...

With a population of more than 10.2 million, Seoul, the capital of South Korea, is the world’s largest city in terms of population. Sao Paulo(Brazil), the world’s second-largest city, has a population of just over ten million. Three other cities, Bombay(India), Jakarta(Indonesia) and Karachi(Pakistan), have grown to more than nine million people.

After ...

Seoul has a population of more than 10.2 million.
Seoul is the capital of South Korea.
Seoul is the world’s largest city in terms of population.
Sao Paulo(Brazil) is the world’s second-largest city.
Sao Paulo(Brazil) has a population of just over ten million.
Bombay(India) has grown to more than nine million people.
Jakarta(Indonesia) has grown to more than nine million people.
Karachi(Pakistan), has grown to more than nine million people.

The Question

How can I use perl to automatically (or at least partially) generate AFTER from BEFORE

The Background

There is a guy who wants to do this sort of thing, with the following disclaimers:

  • The guy is not a linguistics professor
  • The guy wants to spit out questions for a 'flashcard' type thingy
  • The guy prefers practical nuts and bolts examples to pie-in-the-sky visions of 'AI'
  • The guy wishes to avoid esoteric concepts beyond the grasp of a moderately competent college student who knows some perl.
  • The guy admits this is the stuff of decades of reasearch, multitudinous PhD theses, and towering artifices of herculean intellectual endeavor, but still wants a simple solution from perlmonks.org.

The Speculations

The guy has toyed with the following speculations:

  • Build a 'corpus' of domain-compatible 'trigger words' and use 'split' with those as delimiters (eg 'has a','is a','having a', 'has grown', etc)
  • Simply split the BEFORE text on punctuation, call those 'the building blocks' and randomly generate different structures based on those building blocks, discarding (by hand) all but those which make sense.

The Disclaimer

Yes, the guy has seen the nodes on NLP and searched around a bit, but answers always seem shrouded in a funk of elaborately ornate statistical contrivances that seem overly complicated for the task at hand. The guy was reluctant to ask this question, but WTH, someone might be able to help with a breakthrough.

Comment on The (futile?) quest for an automatic paraphrase engine
Re: The (futile?) quest for an automatic paraphrase engine
by tachyon (Chancellor) on May 17, 2004 at 00:10 UTC

    So this guy wants Natural Language Precessing. Not just for input but also for output. You have an underatanding that this is the holy grail of AI and the subject of quite probably terrabytes of PhD theses. You have only a basic underatanding of Perl (probably not the best/worst language in which to perform AI) and you find the material you have seen too complicated?

    Tell the guy to wait for quantum computers to hit the desktop then post again :-)

    What you can do is split into sentences (even that is non trivial i.e. split /\./, $text Mr. Smith). See Text::Sentence. Past that you have quite possibly the most non-trivial problem in CS/AI.

    The only way you could generate a (still non) trivial solution of vague utility is to constrain the problem to a very limited subset of input text.

    cheers

    tachyon

      There seem to be some fairly nice results from statistical methods. I have a couple of references in my post in this thread, but what it boils down to is that there is the knowledge-based way (yours and my preference, apparently) and at least one statistical method being used. The statistical NLP method is called "clustering" because it creates clusters of semantically relevant sentence constituents and then re-uses those constituents to generate a summarization text.

      --
      Damon Allen Davison
      http://www.allolex.net

        Stats are somewhat like doing spam with Bayes, Fisher/Robinson etc. For certain tasks they can make useful 'educated' guesses but they are still 'dumb' algorithms. If you look at how a child learns language they do seem to use a suck it and see approach. They then get feedback on if that was a 'winner' or not. As approaches to AI go I think both knowledge and stats based are 'wrong'. While there is no doubt that both can yield useful results they appear to my mind to have finite limits. I favour a fuzzy logic nodal learning framework ie try to build a machine that can learn without trying to tell it exactly how to learn that. The main issue with this is processor speed (or rather the lack of it) combined with the training time. Language processing is actually a good task for this as you have what is effectively a character based input and output stream making the interface simple.

        cheers

        tachyon

Re: The (futile?) quest for an automatic paraphrase engine
by Zaxo (Archbishop) on May 17, 2004 at 00:21 UTC

    Your guy not only wants pie in the sky, he also wants it growing on trees. Academics are not so devious that they make simple problems hard on purpose. This really is hard.

    After Compline,
    Zaxo

Re: The (futile?) quest for an automatic paraphrase engine
by kiat (Vicar) on May 17, 2004 at 00:41 UTC
    Interesting but highly sophisticated. Most probably you would need not just a syntactic analysis (how the words are strung together) of the input sentence(s) but a semantic one (the meanings of the words) as well.

    The parser has to "understand" the following partial realisations of the original first sentence:

    With a population of more than 10.2 million, Seoul, the capital of South Korea, is the world’s largest city in terms of population.
    1) Seoul, the capital of South Korea, is the world’s largest city in terms of population with a population of more than 10.2 million.

    2) The capital of South Korea, Seoul, is the world’s largest city in terms of population with a population of more than 10.2 million.

    3) With a population of more than 10.2 million, the capital of South Korea, Seoul, is the world’s largest city in terms of population.

    4) In terms of population, Seoul, the capital of South Korea, is the world’s largest city with a population of more than 10.2 million.

    Update

    I think the first task of any parser is to recognise that the original first sentence (or its transformed counterparts) has the following constituents:

    {With a population of more than 10.2 million}pp, {Seoul}np, {the capital of South Korea}np, {is the world’s largest city}vp {in terms of population}pp.

    where

    pp->prepositional phrase

    np->noun phrase

    vp->verb phrase

    With some embeddings of constituents:

    {With a population {of more than 10.2 million}pp}pp

    {the capital {of South Korea}pp}np

      With some embeddings of constituents: {With a population {of more than 10.2 million}pp}pp {the capital {of South Korea}pp}np

      It gets worse. It has to understand modifiers -- and know which modifiers are modifying what. For example, the prep. phrase starting with "with a pop..." in the example sentence modifies Seoul, but does "in terms of population" modify "is" adverbially, or does it modify "city" adjectivally? A human can analyze what the sentence would mean each way and conclude that it doesn't matter -- the meaning is the same. You're going to ask AI to figure that out?

      Let me lay it on the line: it is *possible* to achieve *sporadic* and *partial* results using an assortment of tricks, but a human is still going to have to go over the results. It would be interesting academic research, but it is currently not of any practical value, because the programming is going to cost more money than the program's going to save you over the obvious solution of hiring a minimum-wage peon to do it instead of writing the program. Yep, that's right: my recommendation to the OP is, hire a work-study student (who is not majoring in your subject area, preferably) to write your questions or whatever, and just forget about programming it -- unless AI research is interesting to you for its own sake.


      ;$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$;[-1]->();print
Re: The (futile?) quest for an automatic paraphrase engine
by kvale (Monsignor) on May 17, 2004 at 00:54 UTC
    Hmm, you want to
    • parse general natural language,
    • attach a natural semantics to the tokens,
    • reason with the semantically decorated parse tree to create an internal model of the information,
    • From that internal model, generate natural sounding questions in 'flashcard style'.

    No problem! Hire an intelligent English speaker who can do all of the above, because it will be much, much easier and cheaper than attempting to solve any of these AI tasks with perl or any other computer language. HAL 9000 is still science fiction.

    Each one of the taks above is an open problem in natural language processing, except perhaps the last, which could conceivably be executed with a clever template system.

    -Mark

Re: The (futile?) quest for an automatic paraphrase engine
by dimar (Curate) on May 17, 2004 at 01:14 UTC

    Rephrasing the original Question

    The original question (as written) kinda makes it look like the guy is: 1) trying to over-simplify deep problems of AI; and 2) Looking for a 'complete solution'. Both of these is really not the case.

    What I was really getting at, is whether there are some "brute force" strategies to help someone change BEFORE into AFTER, while minimizing the amount of 'hand editing' necessary. Yes, AI is complicated, and I guess so is trying to formulate a question to find ways around it ;-)

      Taking common rules of English into account and trying to simplify the structure of a sentence is not really something that can be "brute-forced" though. Certain words have an explicit usage, others have different uses that must be negotiated using the context in which it appears, a task I think is AI in nature.

      I'm not saying it's impossible though. You could certainly scan a text for sentences that fit a certain predetermined structure for parsing, but it seems to me that making full use of random texts will require some kind of intelligence. That said, I'm also very interested on any other input on the topic.


      Roses are red, violets are blue. All my base, are belong to you.
      If you want a brute-force solution, one approach would be to first create an engine that solves the reverse problem. That is, given the set of facts, create a routine that generates examples of the type of input that you would like to be able to handle.

      If you create an exhaustive list of all the possible inputs, this list could be used as a lookup table to map your desired input back to the output.

      Since the list of inputs is so large, you'll want to find ways to remove the redundancies in the list. You can think of the various AI techniques as being clever ways to make the length of the list smaller by using perl code.

      Take a look at all the Lingua modules, such as Lingua::Stem. Some of these modules are quite inspiring!

      There is nothing wrong with working on a 'grand challenge' type of problem, even if you are an amateur. I have worked on some of these challenges. When I get something working reasonably well, people say something like, "Oh, of course, if that's all you want, it is easy. The really hard part is..."

      It should work perfectly the first time! - toma
      some "brute force" strategies to help someone change BEFORE into AFTER, while minimizing the amount of 'hand editing' necessary.

      Oh, you only want to minimize the hand-editing necessary, not eliminate it. Yes, that's a much easier problem. Copy and paste will help for starters. It's also pretty easy to write tools that will do things like capitalize the first letter of every sentence, so that you don't have to fuss with that when you rearrange sentence parts so that the first letter is not the first letter anymore. Using a full-featured text editor (e.g., Emacs) will help considerably too, since you can do things like bind a keystroke to highlight forward to the next punctuation mark and cut or copy the result to the clipboard. (Then all you have to do is go paste it...) It's actually pretty easy to get the computer to do a lot of your editing work for you -- as long as it doesn't have to decide what editing to do.


      ;$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$;[-1]->();print
Re: The (futile?) quest for an automatic paraphrase engine
by hv (Parson) on May 17, 2004 at 01:34 UTC

    Since there are so many people here saying how hard this is, I guess it wouldn't work. But I'd suggest starting with the first sentence and working it into the pattern of passive text (words that will be pulled unmodified into a result sentence) and active (the structural words from which you'll derive meaning).

    It might look something like this:

    With A, B, C is D.
    from which you want to pull out:
    B has A. B is C. B is D.

    So, write a program that accepts a mapping of patterns to results, and get it to the point that it can parse the first sentence and produce those results. Then add a pattern and mappings for the second sentence.

    At some point you'll probably find that you need to distinguish things slightly better - perhaps this first pattern only works because 'B' maps to a place name (or more generally a proper noun). So start embellishing the patterns to allow the additional semantics to be expressed.

    Pronouns are likely to make things rather more difficult, so leave them out unless you can find a way to embellish the patterns to say "this would reset the current topic". Similarly, you're unlikely to cope well with classical problem cases such as "time flies like an arrow", so don't support them. You may well find that you can get a 90% solution for a restricted language space, particularly if the text to be parsed is using very standardised constructs.

    Repeat until any(bored, problem is solved, you find out why this approach won't work). :)

    Hugo

      Sorry to nit-pick, but given a template like this:
      With A, B, C is D.
      the most common examples are things like "With Ripoffsky, a first-round draft pick, Green Sox manager Frump is unlikely to see a penant this year." That is:
      A is (was?) B. C has A. C is D.
      Of course, a non-trivial part of the "project" at hand is to pick a suitable "corpus" of sentences that lend themselves to this sort of treatment -- and I don't know any automated way to handle that either.

      There are some fairly well-developed means for spotting "entities" (especially "named entities") -- i.e., the referent noun phrases that make up the subjects and objects of factoids. There has even been some progress on trying to link pronominal references to "named entities" with some degree of success (yes, this is much harder, and quite impossible to do algorithmically for a large percentage of cases -- humans often get this wrong). And some progress on "roles" of entities within sentences (agent, recipient, direct-object etc), but again with much left to be desired.

      Still, if the idea is simply to provide some guidance to humans who have to come up with flash-card text (or trivia questions and answers), there are a number of Part-of-speech (POS) taggers out there that can at least do a decent job of labeling nouns, verbs, prepositions, etc. Whether this can be a useful aid to flash-card authors is another question, but there's some room for the imaginative GUI designer to try things out...

Re: The (futile?) quest for an automatic paraphrase engine
by toma (Vicar) on May 17, 2004 at 02:02 UTC
    Parsing english is an interesting node about this, I think that my response there may still be useful.

    It is not a bad idea to try to solve this type of problem. You can't help but learn from it, and it is interesting. The degree of difficulty depends on the size of the domain that you cover.

    As I've said before, don't be discouraged by a lack of reported successes in this area. Much of the work is outside of public view.

    I think Perl is a reasonable language choice for AI. It is commonly used by PhD-AI-researchers in their posh industrial research labs.

    It should work perfectly the first time! - toma
Re: The (futile?) quest for an automatic paraphrase engine
by pbeckingham (Parson) on May 17, 2004 at 02:22 UTC

    This is hard. This is AI hard. It doesn't get much harder than this.

    People have and are devoting their careers to tackling this problem. I recommend leaving this to some very, very smart, dedicated academics with significant long term funding. Don't hold your breath waiting for the CPAN module to show up.

Re: The (futile?) quest for an automatic paraphrase engine
by rupesh (Hermit) on May 17, 2004 at 02:53 UTC
    Hey scooterm, the most difficult part of getting an answer (or finding a solution) is to put the question right.

    You have done an excellent (that being an understatement) job in identifying and analyzing the problem at hand and putting them in words...(and that's my opinion)

    Cheers to you.
Re: The (futile?) quest for an automatic paraphrase engine
by BrowserUk (Pope) on May 17, 2004 at 05:02 UTC

    How long do you have to achieve your goal?

    Given that you (or your friend) are intending to have a human being make final arbitration, you have reduced the problem from one of almost impossible to just very, very hard--but don't let that stop you from trying, if you have the time.

    Whenever you see discussion about natural language programming, the cliched example of "time flies like an arrow" comes up, but why? What is special about that phrase?

    My conclusion (and I'm no linguist as anyone who has read any of my posts will tell you:), is that what is special about that phrase is it doesn't make any sense!

    The challenge of the phrase is supposed to be how would any NLP or AI system be able to make sense of it. The answer is "It can't", but then, neither can a human being.

    Time flies like an arrow.

    Time isn't solid, so how can it fly?

    Ah, but it's an analogy. "Time flies", means it passes very quickly.

    Hang on. Time passes at a constant rate (Einstein aside), and "quickly" is an informal measure of time. So, How can you measure time in terms of time? Time cannot go quickly nor slowly. It just passes.

    But it's not a literal description of how time passes, it's a subjective description. Sometimes, human beings perceive time to pass more slowly or more quickly than at other times.

    Oh, I see. So arrow move quickly, therefore "time flies like an arrow" means that time is perceived to be moving more quickly than... well, when it isn't flying like an arrow?

    But an arrow, leaves the string of say a 70# bow travelling at around 300 ft/sec--that's 200 mph, which is pretty quick relatively--but from that point on it starts to slow down, until it stops!

    So, given that top end sports cars, motorcycles and trains can achieve and sustain 200 mph, an arrow is a pretty piss poor analogy for something travelling quickly.

    Maybe the point it that an arrow goes from A to B and doesn't come back? Unless someone picks it up and fire it back of course. And the analogy is meant to relate to that. Time only travels one way (sci-fi not withstanding:).

    But hang on, if I launch an arrow straight up, then it comes back down. If there's no wind, and the drag from the flights is even, and I manage to launch it exactly vertically, it might even end up back where it started from...or worse.

    Hmmm. Write an AI/NLP program that can divine the meaning of the phrase "Time flies like an arrow".

    while( <DATA> ) { m[time flies like an arrow]i and print( "Does not compute!" ) and next; ## Some other stuff goes here. }

    There you go:)


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      My understanding of "Time flies like an arrow"...

      I think someone (could be Chomksy) paired that sentence with "Fruit flies like a banana."

      I may be wrong but I think "Fruit flies like a banana" is used to demonstrate the difficulty of understanding the meaning of a given utterance. So "Fruit flies like a banana" can be understood as:

      1) A type of insects called fruit flies that like a banana (so 'like' is used a a verb)

      2) A kind of fruit that flies like a banana. ('flies' used a verb and 'like' as a conjunction)

      Incidentally, I found some articles here:

      http://www.fact-index.com/n/na/natural_language_understanding.html

        First, my post was intended to be (semi) humerous. That said, I'll continue my tirade a little further.

        I understand what the phrase and the pairing of the two phrases is meant to demonstrate. However, I would have to say that I think that understanding is derived and consequently artificial.

        Take the second phrase. "Fruit flies like a banana" and your two interpretations of it.

        1. A type of insects called fruit flies that like a banana (so 'like' is used a a verb).

          Can you really relate to anyone actually using that phrase to achieve that meaning?

          I know we might say that "People like a drink", where the singular usage "a drink" does not imply that they only like one, but "Fruit flies like a banana"?

          They might say "Fruit flies like bananas".

        2. A kind of fruit that flies like a banana. ('flies' used a verb and 'like' as a conjunction).

          Hmm. A banana is fruit. Soooo, fruit flies like fruit?

          But fruit doesn't fly. It falls. I can be thrown. And if you put it on an aeroplane, it can be flown somewhere.

          I seriously doubt that either an ornithologist or an aerospace engineer would recognise any of those situations as being "flight".

          About the best interpretation of "Fruit flies like a banana", related to flight, that I can come up with is that:

          Like bananas, fruit doesn't fly.

          Something along the lines of "Flies like a lead ballon", but if that's the meaning that is being conveyed, then the latter is a much better way of conveying it.

        I guess the point I am making is that both phrases are tortuously derived to make the point that natural language processing is hard--but neither are exactly "natural language".

        It's a bit like saying that you cannot make a return trip to the Sun, so therefore space travel, whilst not impossible, is totally unworthwhile. Or building a bridge across the Atlantic is practically impossible, therefore building bridges is a waste of time.

        If you set the goals (for anything) artificially high, then you can render the problem insoluble.

        There are many problems that are generically insoluble in practical time frames, but that doesn't prevent partial solutions to subsets of the generic problem being used every day to good effect.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Remember...?
        "The boat floated down the river sank".
        "Oysters oysters split split"
        Jeez, no wonder half of the class went mad on Semantics and Syntax 101.
Re: The (futile?) quest for an automatic paraphrase engine
by BrowserUk (Pope) on May 17, 2004 at 05:48 UTC

    On a more serious note, the objective of the exercise seems to be to reduce the manual work involved in deriving questions that can be answered from having read a piece of text? Assuming that is the goal, then I think that this may be quite doable, using perl, but it would require a different tack to that you have outlined.

    Rather than trying to break the body of the text up into discrete chunks and then recombine them into possible answers, which a human being can then subset appropriately before deriving a set of questions, turn the process around.

    That is to say. Have the human being construct sets of questions from bodies of example text. Then write a program that takes the sets of questions and the bodies of text and attempts to derive patterns which relate the questions to sequences & relationships of words within the bodies of text.

    It would require a good number bodies of text and sets of questions to 'train' the program, and some reasonable mechanism to allow a human being to correct and refine the patterns matched over time.

    Approaching the problem this way around means that the program does not have to perform any semantic analysis of either the text or the derived questions. It only needs to discover, extract, retain and refine patterns in text. Which, given Perl's backronym, it's powerful regex engine, renowned text handling facilities and good database handling, makes it seem (to me) like a problem that Perl is eminently capable of tackling.

    Of course, if you have a Neural Net handy, they are designed for exactly this type of 'train the computer to recognise patterns in human heuristics, and then allow them to do it for you' problem.

    I briefly worked with an IBM product called "The Integrated Reasoning System" (TIRS) (about which I could find surprisingly little on-line), that was being used to encapsulate the judgments made by human insurance underwriters in arriving at policy costs for "non-standard" insurance risks. This is an infinitely more complex process than deriving questions from a body of text. Having seen, with my own eyes, just how good it became, very quickly, I wouldn't dismiss the rather academic language that most of the papers and articles to do with Neural Nets is couched in too quickly. It maybe tough going at first, but no tougher than the problem that you are trying to solve.

    Oh, and good luck:)


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
Re: The (futile?) quest for an automatic paraphrase engine
by andyf (Pilgrim) on May 17, 2004 at 06:00 UTC
    If you want GPNLP , its a _hard_ problem (as the above posts say). You could leverage the commonality in the output format, realise you are looking for a far smaller sentence subset than a general purpose NLP system. This way you have a hope of practically doing it, of course the method will always be brittle, but as you say you have module Carbon::Life::Mammal::Human to help postprocess.
    1) you are only trying to parse _relationships_
    2) Each relationship you are looking for is either an ISA or HASA relationship.
    3) all final relationships are of the binary form x R y where R is the relationship between x and y
    My 'heuristic beard stroking algorithm' woud be
    1) partition the whole token set into Entities and Relationships. Do this by pulling out all the proper nouns to start with.
    2) find and deconstruct the non trivial compound entities to remove qualifiers and break open sets such as 'Three other cities, x, y and z'
    3) Apply simple set math to setermine the membership of each entity foreach relationship.
    The biggest challenge you might have is moving from n->1 to n->n relationships. Its easy if everything has just one relationship, but Seoul being both Koreas capital and a city with a >10.2M population is the stumbler imho. Don't forget to account for unary attributes (Seoul is rainy) which don't involve another entity. As you say you have looked at some NLP, go back and read read read and there wil be an answer lurking in here somewhere. Just don't try and generalise the problem too much or it will explode, the best way to practical NLP, is to cheat. :) good luck,
    Andy
Re: The (futile?) quest for an automatic paraphrase engine
by ambrus (Abbot) on May 17, 2004 at 06:41 UTC

      This seems impressive, if you use the supplied sample text and ask the questions suggested in the hint. When you deviate from that, however, things get weird. For example, with the same supplied sample input, try asking these questions:

      • What are posed?
      • What is this program?
      • Who is the author?
      • What is NASA? (This is particularly amusing.)
      • What answers questions?
      Mild Spoiler:

      Even better, feed it the info from the top of this thread...

      ? What is the population of Sao Paulo? With a more than 10.2 million Seoul capital South Korea world?s large +st city in terms. ? What is the world's largest city in terms of population? sh: -c: line 1: unexpected EOF while looking for matching `'' sh: -c: line 2: syntax error: unexpected end of file

      ;$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$;[-1]->();print
Re: The (futile?) quest for an automatic paraphrase engine
by allolex (Curate) on May 17, 2004 at 08:16 UTC

    There is no brute force strategy that is going to help you solve this problem. The reason for this is that there are a very large number of ways to express the same idea in language and whatever solution you're using has to take that into account. The most common "solutions" out there tend to restrict their summarization to a specific domain, which helps because you can then look for things like keywords (your 'trigger words') to help you.

    If your specific case involves connecting declarations to cities, you could create a rule like this:

    while sentence contains CITY + VERB # Assuming subject + verb + object + (SVO) word order define TOPIC as CITY until sentence contains ANOTHER_CITY + VERB

    But unfortunately, that's about as far as you get using "simple" methods. The problem here is that English has other ways of defining topics than just having the topic be the syntactic subject argument of the verb. So you have to have something which more or less "understands" your target language before you can summarize. There can really be no randomness about it. There is a project, however, that uses a statistical NLP method called "clustering" to get decent results in a general topic domain.

    Have a look at the source of the MEAD summarizer to see how a statistical summarizer works (It's written in Perl). You can also see what a such a summarization engine produces at News in Essence, a news-domain summarizer. My personal preference would be a knowledge-based approach using a chunker (or "shallow parser") to get at the semantically relevant bits of the text.

    --
    Damon Allen Davison
    http://www.allolex.net

Re: The (futile?) quest for an automatic paraphrase engine
by Abigail-II (Bishop) on May 17, 2004 at 10:45 UTC
    Usually it's the problem of a posting that intrigues me. Not this time, what intrigues me are the questions "Who is 'this guy'?", and "What is your relationship with him?".

    Abigail

      I thought "this guy" owned a term-paper selling web site when I first read the question. :)

      To the OP: As others have said, this problem is HARD! You can't really tackle this without AI knowledge, I don't care what "this guy" thinks. In fact, there is probably a Turing award out there if you can solve NLP of English idioms and cliches and such -- especially when coupled with voice recoginition and translation systems. However, I don't think you or "this guy" are going to get it any time soon. Pick up a copy of Russell & Norvig, or equivalent, and begin to learn what "hard problems" in AI mean versus what most other folks consider "hard problems". AI is a whole 'nother animal. And it's still a very loose science (overhyped too -- in that few understand what it actually is), with a lot of room for ground-breaking. Perl (due to functional constructs, etc) isn't a bad language for it at all, however.

        To Moose and Abigail

        Well I can tell you that 'the guy' is definitely far less mysterious or nefarious than some term-paper plagiarist or something equally unethical like that. The thread gives hint as to the who, what and why.

        However, I will say it seems the particular writing style and nature of the question, combined with the insight of the responses has generated far more fascinating feedback than I coulda ever anticipated. There is much to chew on here. For that, I say hats off to all of you.

        Oh, by the way, the paraphrase engine offers the following tidbit ...

        The term "AI" is defined as 'anything that hasn't yet been demonstrated as feasible for a computer programmer'

        ;-)

Re: The (futile?) quest for an automatic paraphrase engine
by Anonymous Monk on May 17, 2004 at 14:52 UTC
    I work for a company that does something similiar, and I can tell you this is some seriously hard stuff and seriously expensive to buy. If it was so simply that someone could just wipe it out with a simple algorithm, you would put me out of job :)
Perfect? No. Good enough? Maybe?
by Wally Hartshorn (Friar) on May 17, 2004 at 19:27 UTC

    I suspect the question isn't "Is there a way to do this and get 100% correct results" (answer: no), but rather "Is there a way to do this and get perhaps 60% correct results" (answer: ???). Yes, it might get tripped up by "time flies like an arrow" vs. "fruit flies like a banana" stuff, but if you set the bar lower, could something useful be created? (As to whether "60% correct" would be useful, that will be a question for your friend to consider.)

    Wally Hartshorn

Re: The (futile?) quest for an automatic paraphrase engine
by rje (Deacon) on May 17, 2004 at 20:43 UTC
    Well, if you can guarantee that the specimen encapsulates all of the grammar rules you're likely to find, and the sentences themselves consist of the only patterns you're going to find, then you can brute-force some perl out that's not too painful. Woodenly using your input sample as THE pattern, a 30-line script can blindly cobble together this kind of output (not perfect but close):
    Seoul: population  more than 10.2 million
    Seoul: capital  South Korea
    Seoul: is  world's largest city  terms  population.
    
    Sao Paulo(Brazil): world's second-largest city
    Sao Paulo(Brazil): has  population   over ten million.
    
    Three other cities: have grown to more than nine million people.
    
    Bombay(India): have grown to more than nine million people.
    
    Jakarta(Indonesia) and Karachi(Pakistan): have grown to more than nine million people.
    

      Hey dude, where's the code?

        Frankly, I'm embarrased, because I'm BFI'ing it, instead of doing things properly.

        But here goes. Against my better judgement.
        # # WARNING WARNING WARNING WARNING # # USE AT YOUR OWN RISK. # # THIS IS A MASSIVE KLUDGE. # # YOU HAVE BEEN WARNED. # my $in = <DATA>; # ASSUME sentences end in a period and a space. my @sentences = split '\. ', $in; foreach( @sentences ) { # ASSUME these words are mostly useless # for our purposes... s/\b(with|a|of|the|in|just)\b//gi; # ASSUME phrases are comma-separated. my @phrases = split ','; my @subjects = (); my @descs = (); foreach ( @phrases ) { s/^\s*//; # trim leading spaces. s/\n//g; # remove newline. # Well, do we have a subject, or a descriptor? # ASSUME subjects are capitalized (!!) push @subjects, $_ if /^[A-Z]/; # ASSUME descriptions are not. push @descs, $_ unless /^[A-Z]/; } # Print 'em all out. foreach my $subj ( @subjects ) { my @subsub = ($subj); # ASSUME 'and' separates multiple subjects (!!) @subsub = split ' and ', $subj if $subj =~ /\band\b/; foreach my $ss (@subsub) { print "$ss: $_\n" foreach @descs; } } } __DATA__ With a population of more than 10.2 million, Seoul, the capital of Sou +th Korea, is the world's largest city in terms of population. Sao Pau +lo(Brazil), the world's second-largest city, has a population of just + over ten million. Three other cities, Bombay(India), Jakarta(Indones +ia) and Karachi(Pakistan), have grown to more than nine million peopl +e.
        The output:
        Seoul: population more than 10.2 million Seoul: capital South Korea Seoul: is world's largest city terms population Sao Paulo(Brazil): world's second-largest city Sao Paulo(Brazil): has population over ten million Three other cities: have grown to more than nine million people. Bombay(India): have grown to more than nine million people. Jakarta(Indonesia): have grown to more than nine million people. Karachi(Pakistan): have grown to more than nine million people.
Re: The (futile?) quest for an automatic paraphrase engine
by jonadab (Parson) on May 17, 2004 at 20:56 UTC
    answers always seem shrouded in a funk of elaborately ornate statistical contrivances that seem overly complicated

    There's a reason for this, and it's not because people don't want to solve the problem.


    ;$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$;[-1]->();print
Re: The (futile?) quest for an automatic paraphrase engine
by Fuzzy Frog (Sexton) on May 18, 2004 at 20:41 UTC

    Others have mentioned how hard natural language processing is. The actual deductions are considerably easier, at least the kind you seem to want. There is a computer language called Prolog which is specifically designed to generate valid deductions from appropriately structured data. Prolog is out of vogue in the AI community because purely deductive logic is rather sterile. Still, playing with it a little (in my case very little) will give you a feeling for how complicated human reasoning is.

    On the subject of the nature of time, flies and arrows...

    I haven't read the Chomsky paper, but I think he used the sentence to illustrate different types of parse trees for English. He was not (primarily) making a statement about the ambiguity of the language.

    Btw, there are at least two other parsings of "Time flies like an arrow. Time could be an imperative verb, which yeilds the meanings:

    Time some insects as you would time an arrow.

    Time insects the way an arrow would time them.

    I keep thinking there is a fifth parsing as well, but I can't remember what it is.

    -- Fuzzy Frog

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://353840]
Approved by BrowserUk
Front-paged by ysth
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (8)
As of 2014-12-28 06:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (179 votes), past polls