Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Keywords and keyphrases extraction from text

by vit (Friar)
on May 13, 2009 at 14:27 UTC ( [id://763774]=perlquestion: print w/replies, xml ) Need Help??

vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
Does anybody know or use a module or code for Keywords and Keyphrases extraction from text documents.
  • Comment on Keywords and keyphrases extraction from text

Replies are listed 'Best First'.
Re: Keywords and keyphrases extraction from text
by ELISHEVA (Prior) on May 13, 2009 at 16:52 UTC

    Extracting arbitrary "key phrases" is really a form of natural language parsing. To identify such key phrases you will need to develop a program that:

    • splits text into words.
    • identifies the part of speech associated with each word
    • identifies phrases composed of the correct combinations of parts of speech. A non-exhasutive list of the correct combinations would look something like this:
      noun_phrase :: (adjective*) noun prepositional_phrase prepositional_phrase :: preposition [article] noun_phrase

    Not a one of these are entirely trivial.

    • Splitting into words.White space and punctuation will make a good first cut at splitting up text into words, but there will be important edge cases: hyphenated compound words in English are the first to come to mind. Sometimes hyphens are punctuation similar to a comma, sometimes they are minus signs, and sometimes they count as part of a compound word. If you care about capturing compound words in your keyword algorithm, your program will need to determine which one is which.
    • Identifying parts of speech. Many words in English have multiple parts of speech depending on context. For example, "Johnny broke Nancy's nail file whilst trying to use it to nail nails.". In that single sentence the word "nail" is used as a noun, verb and adjective! Furthermore you will need to take into account all the forms of the noun. In English that is fairly simple - we have only four: singular, singular possessive, plural, plural possessive. In other languages the situation is much more complicated.
    • Identifying runs of words that contain the correct part of speech combinations. I started this post with a partial description of the valid part of speech sequences. Once your text is tagged, this may well be the easiest part of the entire process.

    As for modules that do that, I think you may be on your own. There are a few CPAN modules that can help you on your way, but they won't do the whole job for you. I've never used these but you may want to check out the various modules in the "Lingua" namespace:

    Best, beth

      Thank a lot,
      But could you tell me what you use.
      noun_phrase :: (adjective*) noun prepositional_phrase prepositional_phrase :: preposition [article] noun_phrase
      looks like regular expressions on parts of speach to create word chunks.
Re: Keywords and keyphrases extraction from text
by planetscape (Chancellor) on May 13, 2009 at 15:49 UTC
      Is it working for keyphrases too?
      Do you use any dictionary, thesaurus, apriory training, etc.?

        Mine is a pretty simplistic system. Reading the docs for Lingua::EN::Keywords and Lingua::EN::Summarize will give you an idea of some of the limitations. However, for the task at hand at the time, it worked well enough.

        I've put together a small example, using your sample input, by cutting'n'pasting from the program I wrote back in Sept. 2005 to show you. Is it pretty, or would I write things the same way today? No and probably not, but it should suffice to illustrate:

        use strict; use warnings; use Lingua::EN::Keywords; use Lingua::EN::Summarize; use Lingua::StopWords; my $allcontent = 'Sky Travel Executives provide a rapid and reliable A +irport Transfer service which specializes in catering for airport tax +i transfers to and from all major London airports. 24 Hours executive + cars and luxury 6/7 seater mini vans available at Heathrow Airport , + Gatwick Airport , Stansted Airport , Luton Airport and City Airport. + '; my @keywords = keywords($allcontent); print "Keywords:\n"; print "=========\n"; foreach my $keyword (@keywords) { print $keyword, "\n"; } print "\n"; my $summary = summarize( $allcontent ); print "Summary:\n"; print "========\n"; print $summary; print "\n\n";

        It prints no summary for your text (too short?) and the keywords it picks are less than optimal.

        Keywords: ========= airport major london airports. mini cars travel seater Summary: ========

        I'd experiment with longer inputs, myself, and/or some system of weighting certain keywords.

        You may also find helpful some of Ted Pedersen's work, which I've discussed before.

        HTH,

        planetscape
Re: Keywords and keyphrases extraction from text
by wjw (Priest) on May 13, 2009 at 14:36 UTC
    You might want to be a bit more specific. Finding and extracting subtext from text is something Perl if very good at. But then so are a bunch of utilities out there. 'wc -l' is an example. Some questions:
    • Where are your key words/phrases?
    • Are you looking in one file or many?
    ...etc...

    Be specific about what you don't know how to do.. :-)

    ########################################################

    • ...the majority is always wrong, and always the last to know about it...
    • The Spice must flow...
    • ..by my will, and by will alone.. I set my mind in motion
      Sorry I do need to be more specific.
      This is more then just "how to programm" question. What I mean is I need to process the text in such a way that it will be split into meaningful words and words combinations, which is called keywors keyphrases extraction.
      For example for the text

      Sky Travel Executives provide a rapid and reliable Airport Transfer service which specializes in catering for airport taxi transfers to and from all major London airports. 24 Hours executive cars and luxury 6/7 seater mini vans available at Heathrow Airport , Gatwick Airport , Stansted Airport , Luton Airport and City Airport.

      It will be
      Sky Travel Executives, Airport Transfer service, catering for airport taxi transfers, executive cars, etc.
      and NOT, e.g.
      seater mini,...
Re: Keywords and keyphrases extraction from text
by leocharre (Priest) on May 13, 2009 at 15:05 UTC
    Well... what do you mean.. Do you mean that
    • a) you already have a list of keywords and keyphrases, and you want to examine any possible text document for the existence of such..
    • b) you have a predefined list of test text documents, and you want to find what the keyphrases are between them- that they share in common or something else.
      No,
      For a document extract meaningful phrases, see me previous answer.
        You are still not really answering the question.

        Do you have a predefined list of what you deem to be a keyword and or keyphrase?

        Do you think the computer will magically *know* what those are supposed to be?

        (How do you think google and them figure out what the keywords and keyphrases are? They are told this by what their machines see on the web, if two pages link to another with the anchor text 'bogus meatball', then that is now a keyphrase to them of some value. )

        Are you going to be using a dictionary list, a list that you have at work... etc etc etc..

        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Keywords and keyphrases extraction from text
by cosmicperl (Chaplain) on May 13, 2009 at 18:00 UTC
    Hi vit,
      Funnily enough I just updating a script that does something like this. The code I use is:-
    ### Clean up the text to make it easier to search $bodytext =~ s/\n/ /gis; $bodytext =~ s/\r//gis; $bodytext =~ s/\t/ /gis; $bodytext =~ s/ - / /gis; while ($bodytext =~ / /) { $bodytext =~ s/ / /gis; }#while ### match 2 word groups while ($bodytext =~ /\b([A-Za-z'\-]+ [A-Za-z'\-]+)\b/g) { print "$1\n"; }#while
    It's a bit hacky, but works. Although there is a nasty bug with it, I'm hoping someone will have the answer here


    Lyle

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://763774]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2024-06-24 06:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.