Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Re: Re: The (futile?) quest for an automatic paraphrase engine

by rje (Deacon)
on May 18, 2004 at 14:51 UTC ( [id://354286]=note: print w/replies, xml ) Need Help??


in reply to Re: Re: The (futile?) quest for an automatic paraphrase engine
in thread The (futile?) quest for an automatic paraphrase engine

Frankly, I'm embarrased, because I'm BFI'ing it, instead of doing things properly.

But here goes. Against my better judgement.
# # WARNING WARNING WARNING WARNING # # USE AT YOUR OWN RISK. # # THIS IS A MASSIVE KLUDGE. # # YOU HAVE BEEN WARNED. # my $in = <DATA>; # ASSUME sentences end in a period and a space. my @sentences = split '\. ', $in; foreach( @sentences ) { # ASSUME these words are mostly useless # for our purposes... s/\b(with|a|of|the|in|just)\b//gi; # ASSUME phrases are comma-separated. my @phrases = split ','; my @subjects = (); my @descs = (); foreach ( @phrases ) { s/^\s*//; # trim leading spaces. s/\n//g; # remove newline. # Well, do we have a subject, or a descriptor? # ASSUME subjects are capitalized (!!) push @subjects, $_ if /^[A-Z]/; # ASSUME descriptions are not. push @descs, $_ unless /^[A-Z]/; } # Print 'em all out. foreach my $subj ( @subjects ) { my @subsub = ($subj); # ASSUME 'and' separates multiple subjects (!!) @subsub = split ' and ', $subj if $subj =~ /\band\b/; foreach my $ss (@subsub) { print "$ss: $_\n" foreach @descs; } } } __DATA__ With a population of more than 10.2 million, Seoul, the capital of Sou +th Korea, is the world's largest city in terms of population. Sao Pau +lo(Brazil), the world's second-largest city, has a population of just + over ten million. Three other cities, Bombay(India), Jakarta(Indones +ia) and Karachi(Pakistan), have grown to more than nine million peopl +e.
The output:
Seoul: population more than 10.2 million Seoul: capital South Korea Seoul: is world's largest city terms population Sao Paulo(Brazil): world's second-largest city Sao Paulo(Brazil): has population over ten million Three other cities: have grown to more than nine million people. Bombay(India): have grown to more than nine million people. Jakarta(Indonesia): have grown to more than nine million people. Karachi(Pakistan): have grown to more than nine million people.

Replies are listed 'Best First'.
Re: Re: Re: Re: The (futile?) quest for an automatic paraphrase engine
by Anonymous Monk on May 19, 2004 at 02:12 UTC

    It's nice how you put up the *warning siren!!* on your assumptions ... Although in isolation, some might criticize the assumptions as overly simplistic (even the OP??), I bet something like this could actually work as the beginnings of a very flexible tool. It would be a matter of building up a 'catalogue' of such assumptions, make them user-configurable (eg apply only a certain subset based on the input text specimen) and give the user the opportunity to add custom assumptions. Moreover, this kind of model is realatively straightforward to understand with low entry-barrier-learning-curve. ... this one got the wheels turning hmmm ...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://354286]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-26 03:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found