Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Keep It Simple, Stupid
 
PerlMonks  

Extracting keywords from HTML

by wfsp (Abbot)
on Aug 21, 2005 at 12:41 UTC ( #485522=perlquestion: print w/ replies, xml ) Need Help??
wfsp has asked for the wisdom of the Perl Monks concerning the following question:

I have adopted the following definition of a keyword:

Word

  • Case ignored.
  • Min length: 3, Max length: 20
  • Can include a hyphen or an apostrophe but not at either end (these are stripped). Possessive 's (cat's) also stripped. (can't, won't, hasn't etc. are in the stop list).
  • All other puctuation ignored.
  • Four digit numbers between 1000 and 3000 with an optional trailing s (1960s). Anything else with a number in it is skipped.
  • Skip common words (stop words).

HTML

To preserve the apostrophe ’ and ' are replaced with '.

All other HTML entity punctuation is then removed and the HTML decoded. Apart from punctuation it is all Latin1 (I've checked it - at length!).

Update 2
I should have mentioned that the text has already been stripped from an HTML file. (!)
Apologies for any confusion

#!/usr/bin/perl use strict; use warnings; use HTML::Entities; my $config = { minw => 3, maxw => 20, }; my $stop = get_stop(); my $punc = get_punc(); my $text = q| cat&#39;s dogs O&rsquo;Reilly ad-hoc &ldquo;broken&rdquo; hyphen- the and |; my $words_all = {}; # contrived loop to show usage for my $t ($text){ get_word($t); } print "$_\n" for keys %{$words_all}; sub get_word{ my ($text, $file_key) = @_; my ($min, $max) = ($config->{minw}, $config->{maxw}); for ($text){ s/&rsquo;|&#39;/'/g; s/(&#?\w+;)/exists $punc->{$1}?' ':$1/eg; } decode_entities($text); $text =~ s/[^\w'-]/ /g; my @words = split ' ', $text; for (@words){ s/^['-]//g; s/['-]s?$//; next if length() < $min or length() > $max; next if exists $stop->{$_}; next if /\d/ and not /^[12]\d{3}s?$/; next if /--/; push @{$words_all->{$_}}, $file_key; } } sub get_stop{ # sample return { qw( and '' any '' the '' they '' ) }; } sub get_punc{ # sample return { '&rsquo;' => undef, '&lsquo;' => undef, '&rdquo;' => undef, '&ldquo;' => undef, }; } __DATA__ ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl hyphen cat ad-hoc O'Reilly broken dogs > Terminated with exit code 0.

At the moment the full app is run locally on a copy of the web site.

It's generating 42k words but I'm working on the stop file to try and bring it down.

If this turns out to be fairly stable I'm considering compiling the regexes outside of the loop.

What do you reckon?

winxp, activestate 5.8

Update:
Corrected get_stop() sub

Comment on Extracting keywords from HTML
Select or Download Code
Re: Extracting keywords from HTML
by socketdave (Curate) on Aug 21, 2005 at 13:49 UTC
    Is performance a problem with your current code?
      I've only taken a rough measure of the time it takes.

      There are about 2k files (32MB). From scratch it takes less than a minute (about 45secs on susequent runs). This includes importing the hash into a DBM::Deep db.

      The idea would be to only create it once and then update it.

      That's not a problem for me.

      wintel 2.66Mhz, 640MB

Re: Extracting keywords from HTML
by fizbin (Chaplain) on Aug 21, 2005 at 18:11 UTC

    How do you intend to handle accented letters? Should "resumé" be equivalent to "resume"?

    Right now, as your code stands, those words are not equivalent. If they should be equivalent, you'll want to look at this node I just wrote today that squishes accented letters into their non-accented equivalents.

    Also, I'd suggest some tweaks in your existing code. For example, I'd change get_stop and get_punc as follows:

    sub get_stop{ # sample my %stop = map { $_ => 1 } qw(and any the they); \%stop; } sub get_punc{ # sample my %punc = map { $_ => ' ' } qw(&rsquo; &lsquo; &rdquo; &ldquo;); \%punc; }

    Not only does this form make it easier to add new entries, it makes it easier to use in the rest of your code - you don't need all those calls to exists any more:

    s/(&#?\w+;)/$punc->{$1}||$1/eg;
    and
    next if $stop->{$_};

    Finally, your code as it stands doesn't actually do quite what you described - as a test give it the data:

    ''words in double single quotes''

    The fix of course is to change the regular expressions used to normalize the data:

    for (@words){ s/^['-]+//; s/['-]+s?$//; next if length() < $min or length() > $max; next if $stop->{$_}; next if /\d/ and not /^[12]\d{3}s?$/; # next if /--/; # not needed anymore $words_all->{$_}->{$file_key} += 1; }

    Notice that above I also changed the structure of words_all - any given word is likely to appear several times in a file if it appears there once, and there's no need to keep a huge array with many elements repeated. You can just use keys(%{$words_all->{$word}}) to get the list of files a word appears in, and if you need to know the count, you have that too.

    -- @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
      Thanks for your response.

      resum resume
      Yes I've wrestled with that. At the moment there are both. Which is correct? What could you expect to be typed in? When it's ready I'll ask the site maintainers what they want! Thanks for the pointer.

      Subs
      The subs shown were for the benefit of the post. There are 752 stop words (I intend to increase this) and 182 html punctuation entities which are read in from separate files. I take your point though.

      Double single quotes
      Good point, thanks. I'd actually gone through the html, found the double single quotes (there were many) and removed them. There were also ` (back tick) quotes. Single and double :-)

      words_all hash ref
      I made a mistake there preparing it for the post. In the app I use a hash (%seen) to keep track of words found in each file and then:

      push @{$words_all->{$_}}, $file_key unless exists $seen{$};
      I think your method is better. Later, the count could contribute to some form of weighting system.

      Again, many thanks, John

Re: Extracting keywords from HTML
by superfrink (Curate) on Aug 22, 2005 at 06:11 UTC
    I used the following code to get the words from HTML. The javascript removal came from Re: HTML::Strip Problem.
    my $text = ""; # HTML::Parser is broken on Javascript and styles (it leaves # it in the text). # # my $parser = HTML::Parser->new( # 'text_h' => [ sub { $text .= " " . shift; }, "dtext" ] # ); my $parser = HTML::Parser->new( start_h => [ sub{ $_[0]->{text}.=' '; $_[0]->{skip}++ if $_[1] +eq 'script' or $_[1] eq 'style'; } , 'self,tag' ], end_h => [ sub{ $_[0]->{skip}-- if $_[1] eq '/script' or $_[1 +] eq '/style'; } , 'self,tag' ], text_h => [ sub{ $_[0]->{text}.=$_[1] unless $_[0]->{skip}}, ' +self,dtext' ] ); $parser->parse($html_in); $text = $parser->{text};
Re: Extracting keywords from HTML
by wfsp (Abbot) on Sep 13, 2005 at 09:50 UTC
    This is what I ended up with and am using (incorporating fizbin's points).

    I've changed the definition of a word.

    • Hyphenated words are now separated and treated separately
    • Accented words are now ascii-ised
    The other difference is that HTML entities are decoded while extracting text from the HTML.
    All the text (apart from curly punctuation) is Latin 1.

    #!/usr/bin/perl use strict; use warnings; use locale; my $config = { stop => 'data/stop.txt', minw => 3, maxw => 20, }; my $words = q( and cat's O'Rielly counter-productive crche 2000 1980s 777777 x0000 repeat repeat repeat ); my %keywords = get_word(lc $words); for (keys %keywords){ print "$_\n"; } sub get_word{ my ($text) = @_; my ($min, $max) = ($config->{minw}, $config->{maxw}); my %stop = get_stop($config->{stop}); my %acc = get_accent(); my %pat = get_patterns(); my %keywords; for ($text){ s/$pat{rsqu}/'/g; s/$pat{hyph}/ /g; s/(.)/$acc{$1}?$acc{$1}:$1/eg; } my @words = ($text =~ /$pat{word}/g); my %seen; for (@words){ s/$pat{posv}//; s/$pat{apos}//g; next if length() < $min or length() > $max; next if /$pat{numb}/ and not /$pat{date}/; s/$pat{tail}// if /$pat{numb}/; next if exists $stop{$_}; unless ($seen{$_}){ $seen{$_}++; $keywords{$_} = undef; } } return %keywords; } sub get_patterns{ my %pat = ( posv => q/'s$/, # possessive s apos => qr/'/, # apostrophe hyph => qr/-/, # hyphen numb => qr/\d/, # number date => qr/^[12]\d{3}s?$/, # date like 1960, 1990s rsqu => qr/\x{2019}/, # right single quote tail => qr/s$/, # trailing s for stripping off numbers word => qr/([\w']+)/, # word that may contain apostrophe ); return %pat; } sub get_accent{ return qw( A A A A A A AE C E E E E I I I I TH N O O O O O O U U U U U TH ss a a a a a a ae c e e e e i i i i th n o o o o o o u u u u y th y ); } sub get_stop { # load stop word file # sample my %stop = ( and => undef, the => undef, any => undef, ); return %stop; }
    output:

    ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" clean_words.pl 1980 orielly productive creche cat counter 2000 repeat > Terminated with exit code 0.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://485522]
Approved by lidden
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (10)
As of 2014-04-19 10:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (480 votes), past polls