Extracting keywords from HTML

wfsp has asked for the wisdom of the Perl Monks concerning the following question:

I have adopted the following definition of a keyword:

Word

Case ignored.
Min length: 3, Max length: 20
Can include a hyphen or an apostrophe but not at either end (these are stripped). Possessive 's (cat's) also stripped. (can't, won't, hasn't etc. are in the stop list).
All other puctuation ignored.
Four digit numbers between 1000 and 3000 with an optional trailing s (1960s). Anything else with a number in it is skipped.
Skip common words (stop words).

HTML

To preserve the apostrophe ’ and ' are replaced with '.

All other HTML entity punctuation is then removed and the HTML decoded. Apart from punctuation it is all Latin1 (I've checked it - at length!).

Update 2
I should have mentioned that the text has already been stripped from an HTML file. (!)
Apologies for any confusion

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Entities;

my $config = {
  minw =>  3,
  maxw => 20,
};

my $stop = get_stop();
my $punc = get_punc();

my $text = q|
  cat&#39;s dogs 
  O&rsquo;Reilly 
  ad-hoc 
  &ldquo;broken&rdquo;
  hyphen- 
  the and
|;

my $words_all = {};

# contrived loop to show usage
for my $t ($text){
  get_word($t);
}

print "$_\n" for keys %{$words_all};  

sub get_word{
  
  my ($text, $file_key) = @_;
  my ($min, $max) = ($config->{minw}, $config->{maxw});
  
  for ($text){
    s/&rsquo;|&#39;/'/g;
    s/(&#?\w+;)/exists $punc->{$1}?' ':$1/eg;
  }
  
  decode_entities($text);
  $text =~ s/[^\w'-]/ /g;
  
  my @words = split ' ', $text;
  for (@words){
    s/^['-]//g;
    s/['-]s?$//;
    next if length() < $min or length() > $max;
    next if exists $stop->{$_};
    next if /\d/ and not /^[12]\d{3}s?$/;
    next if /--/;
    push @{$words_all->{$_}}, $file_key;    
  }
}    

sub get_stop{
  # sample
  return {
    qw(
      and ''
      any ''
      the ''
      they ''
    )
  };
}

sub get_punc{
  # sample
  return {
    '&rsquo;' => undef,
    '&lsquo;' => undef,
    '&rdquo;' => undef,
    '&ldquo;' => undef,
  };
}

__DATA__
---------- Capture Output ----------
> "C:\Perl\bin\perl.exe" _new.pl

hyphen
cat
ad-hoc
O'Reilly
broken
dogs

> Terminated with exit code 0.
[download]

At the moment the full app is run locally on a copy of the web site.

It's generating 42k words but I'm working on the stop file to try and bring it down.

If this turns out to be fairly stable I'm considering compiling the regexes outside of the loop.

What do you reckon?

winxp, activestate 5.8

Update:
Corrected get_stop() sub

Comment on Extracting keywords from HTML Select or Download Code

Replies are listed 'Best First'.
Re: Extracting keywords from HTML by fizbin (Chaplain) on Aug 21, 2005 at 18:11 UTC
How do you intend to handle accented letters? Should "resumé" be equivalent to "resume"? Right now, as your code stands, those words are not equivalent. If they should be equivalent, you'll want to look at this node I just wrote today that squishes accented letters into their non-accented equivalents. Also, I'd suggest some tweaks in your existing code. For example, I'd change `get_stop` and `get_punc` as follows: `sub get_stop{ # sample my %stop = map { $_ => 1 } qw(and any the they); \%stop; } sub get_punc{ # sample my %punc = map { $_ => ' ' } qw(’ ‘ ” “); \%punc; }` [download] Not only does this form make it easier to add new entries, it makes it easier to use in the rest of your code - you don't need all those calls to `exists` any more: `s/(&#?\w+;)/$punc->{$1}\|\|$1/eg;` [download] and `next if $stop->{$_};` [download] Finally, your code as it stands doesn't actually do quite what you described - as a test give it the data: `''words in double single quotes''` [download] The fix of course is to change the regular expressions used to normalize the data: `for (@words){ s/^['-]+//; s/['-]+s?$//; next if length() < $min or length() > $max; next if $stop->{$_}; next if /\d/ and not /^[12]\d{3}s?$/; # next if /--/; # not needed anymore $words_all->{$_}->{$file_key} += 1; }` [download] Notice that above I also changed the structure of `words_all` - any given word is likely to appear several times in a file if it appears there once, and there's no need to keep a huge array with many elements repeated. You can just use `keys(%{$words_all->{$word}})` to get the list of files a word appears in, and if you need to know the count, you have that too. `-- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/` [download]	[reply] [d/l] [select]
Re^2: Extracting keywords from HTML by wfsp (Abbot) on Aug 21, 2005 at 18:54 UTC
Thanks for your response. resumé resume Yes I've wrestled with that. At the moment there are both. Which is correct? What could you expect to be typed in? When it's ready I'll ask the site maintainers what they want! Thanks for the pointer. Subs The subs shown were for the benefit of the post. There are 752 stop words (I intend to increase this) and 182 html punctuation entities which are read in from separate files. I take your point though. Double single quotes Good point, thanks. I'd actually gone through the html, found the double single quotes (there were many) and removed them. There were also ` (back tick) quotes. Single and double :-) words_all hash ref I made a mistake there preparing it for the post. In the app I use a hash (%seen) to keep track of words found in each file and then: `push @{$words_all->{$_}}, $file_key unless exists $seen{$};` [download] I think your method is better. Later, the count could contribute to some form of weighting system. Again, many thanks, John	[reply] [d/l]
Re: Extracting keywords from HTML by superfrink (Curate) on Aug 22, 2005 at 06:11 UTC
I used the following code to get the words from HTML. The javascript removal came from Re: HTML::Strip Problem. my $text = ""; # HTML::Parser is broken on Javascript and styles (it leaves # it in the text). # # my $parser = HTML::Parser->new( # 'text_h' => [ sub { $text .= " " . shift; }, "dtext" ] # ); my $parser = HTML::Parser->new( start_h => [ sub{ $_[0]->{text}.=' '; $_[0]->{skip}++ if $_[1] +eq 'script' or $_[1] eq 'style'; } , 'self,tag' ], end_h => [ sub{ $_[0]->{skip}-- if $_[1] eq '/script' or $_[1 +] eq '/style'; } , 'self,tag' ], text_h => [ sub{ $_[0]->{text}.=$_[1] unless $_[0]->{skip}}, ' +self,dtext' ] ); $parser->parse($html_in); $text = $parser->{text}; [download]	[reply] [d/l]
Re: Extracting keywords from HTML by socketdave (Curate) on Aug 21, 2005 at 13:49 UTC
Is performance a problem with your current code?	[reply]
Re^2: Extracting keywords from HTML by wfsp (Abbot) on Aug 21, 2005 at 14:09 UTC
I've only taken a rough measure of the time it takes. There are about 2k files (32MB). From scratch it takes less than a minute (about 45secs on susequent runs). This includes importing the hash into a DBM::Deep db. The idea would be to only create it once and then update it. That's not a problem for me. wintel 2.66Mhz, 640MB	[reply]
Re: Extracting keywords from HTML by wfsp (Abbot) on Sep 13, 2005 at 09:50 UTC
This is what I ended up with and am using (incorporating fizbin's points). I've changed the definition of a word. Hyphenated words are now separated and treated separately Accented words are now ascii-ised The other difference is that HTML entities are decoded while extracting text from the HTML. All the text (apart from curly punctuation) is Latin 1. #!/usr/bin/perl use strict; use warnings; use locale; my $config = { stop => 'data/stop.txt', minw => 3, maxw => 20, }; my $words = q( and cat's O'Rielly counter-productive crèche 2000 1980s 777777 x0000 repeat repeat repeat ); my %keywords = get_word(lc $words); for (keys %keywords){ print "$_\n"; } sub get_word{ my ($text) = @_; my ($min, $max) = ($config->{minw}, $config->{maxw}); my %stop = get_stop($config->{stop}); my %acc = get_accent(); my %pat = get_patterns(); my %keywords; for ($text){ s/$pat{rsqu}/'/g; s/$pat{hyph}/ /g; s/(.)/$acc{$1}?$acc{$1}:$1/eg; } my @words = ($text =~ /$pat{word}/g); my %seen; for (@words){ s/$pat{posv}//; s/$pat{apos}//g; next if length() < $min or length() > $max; next if /$pat{numb}/ and not /$pat{date}/; s/$pat{tail}// if /$pat{numb}/; next if exists $stop{$_}; unless ($seen{$_}){ $seen{$_}++; $keywords{$_} = undef; } } return %keywords; } sub get_patterns{ my %pat = ( posv => q/'s$/, # possessive s apos => qr/'/, # apostrophe hyph => qr/-/, # hyphen numb => qr/\d/, # number date => qr/^[12]\d{3}s?$/, # date like 1960, 1990s rsqu => qr/\x{2019}/, # right single quote tail => qr/s$/, # trailing s for stripping off numbers word => qr/([\w']+)/, # word that may contain apostrophe ); return %pat; } sub get_accent{ return qw( À A Á A Â A Ã A Ä A Å A Æ AE Ç C È E É E Ê E Ë E Ì I Í I Î I Ï I Ð TH Ñ N Ò O Ó O Ô O Õ O Ö O Ø O Ù U Ú U Û U Ü U Ý U Þ TH ß ss à a á a â a ã a ä a å a æ ae ç c è e é e ê e ë e ì i í i î i ï i ð th ñ n ò o ó o ô o õ o ö o ø o ù u ú u û u ü u ý y þ th ÿ y ); } sub get_stop { # load stop word file # sample my %stop = ( and => undef, the => undef, any => undef, ); return %stop; } [download] output: `---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" clean_words.pl 1980 orielly productive creche cat counter 2000 repeat > Terminated with exit code 0.` [download]	[reply] [d/l] [select]


Pathologically Eclectic Rubbish Lister
	PerlMonks