Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Simple Keyword Generator

by itsscott (Acolyte)
on Sep 12, 2012 at 19:27 UTC ( #993286=perlquestion: print w/ replies, xml ) Need Help??
itsscott has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

I wanted a simple keyword generator for my CMS that I have written in ANSI C. So I came up with this, but am not sure the best way to implement it (or something similar if it exists, I was not able to find a module or anything that does this 'type' of thing.

I know this is very rudimentary and quite possibly not very efficient, but I'm fairly green with perl.

The first thing I was wondering, is it possible to actually compile this snippet into an ANSI C program? or is that foolish thinking?

I can CURL to the program and pass it the $line and the $stopwords if that is easiest (using CGI module I would assume, I have toyed with that in the past.)

I can use a system call as well, but I always get told that's not a good method.

Lastly I can connect to the Mysql database and get the body content as well, but that's a little more work, as I have not used mysql in my perl before

I am open to any other suggestions as well!, Thank you in advance for any and all wisdom imparted!

#!/usr/bin/perl use strict; use warnings; my $line = <<TEXT; Moby-Dick was published in 1851 during a productive time in American l +iterature, which also saw the appearance of Nathaniel Hawthorne's The + Scarlet Letter and Harriet Beecher Stowe's Uncle Tom's Cabin. Two ac +tual events served as the genesis for Melville's tale. One was the si +nking of the Nantucket ship Essex in 1820, after it was rammed by a l +arge sperm whale 2,000 miles (3,200 km) from the western coast of Sou +th America.[4][5][6] First mate Owen Chase, one of eight survivors, r +ecorded the events in his 1821 Narrative of the Most Extraordinary an +d Distressing Shipwreck of the Whale-Ship Essex. The other event was the alleged killing in the late 1830s of the albin +o sperm whale Mocha Dick, in the waters off the Chilean island of Moc +ha. Mocha Dick was rumored to have twenty or so harpoons in his back +from other whalers, and appeared to attack ships with premeditated fe +rocity. One of his battles with a whaler served as subject for an art +icle by explorer Jeremiah N. Reynolds[7] in the May 1839 issue of The + Knickerbocker or New-York Monthly Magazine. Melville was familiar wi +th the article, which described: TEXT my $stopwords = "and|that|they|very|you|your|want|are|able|aren|are|bu +t|doesn|the|see|not|most|many|need|needs|look|just|get|from|for|all|t +his|have|who|with|was|went|when|has|him|his|what|which|while|two"; $line =~ s/[[:punct:]]|[0-9]/ /g; $line = lc ($line); $line =~ s/\b(?:$stopwords)\b/ /gi; my %count_of; foreach my $word (split /\s+/, $line) { length($word) > 2 and $count_of{$word}++; } print "All words and their counts: \n"; for my $word (sort keys %count_of) { $count_of{$word} > 1 and print "'$word': $count_of{$word}\n"; } __END__

Comment on Simple Keyword Generator
Download Code
Re: Simple Keyword Generator
by stonecolddevin (Vicar) on Sep 12, 2012 at 19:32 UTC

    I'm not sure what your end goal for this is, but if (I'm assuming here) you're wanting to do some sort of search engine functionality, check out metacpan::Lucy or ElasticSearch. They'll save you a few headaches down the road.

    Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past

      Actually it's not for searching, it's just a helper for users to make meta keyword tags for their web pages.
Re: Simple Keyword Generator
by hbm (Hermit) on Sep 12, 2012 at 19:55 UTC

    One really minor thing I'd do differently: Rather than split on whitespace and then check length, just match 3 or more non-whitespace characters:

    #foreach my $word (split /\s+/, $line) { # length($word) > 2 and $count_of{$word}++; #} $count_of{$_}++ for $line =~ /\S{3,}/g;
Re: Simple Keyword Generator
by RichardK (Priest) on Sep 12, 2012 at 23:44 UTC

    I wouldn't use search and replace to detect the stop word, it's doing much more work than you need and is going to get really slow as your stop word list increases.

    If you store your words in a hash then it's easy and efficient to test if a word exists in the hash, so then you can do something like this

    -- err not tested :)

    my @words = qw/ and or not one two three/; my %stop; $stop{$_}++ for @words; ... for my $w (split /\s+/, $line) { next unless length($w) > 2; next if $stop{$w}; ... $keys{$w}++; }

    update -- fix typos

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://993286]
Approved by bulk88
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2014-07-28 06:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (192 votes), past polls