Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Counting the keywords in the text file

by moviesigh (Initiate)
on Dec 01, 2013 at 03:02 UTC ( [id://1065109]=perlquestion: print w/replies, xml ) Need Help??

moviesigh has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I would like to count the frequency of certain keywords in the text file, which is sample.txt. For example, I determine a main word as "Steve Jobs" and "Executive," and I would like to count the frequency of "stock option" and "package" within 10 words from "Steve Jobs" and "Executive" for the sample text below. The result that I expected is 4. Sample text) Stock option is the most popular compensation policy in the world these days. Steve Jobs also received huge amount of stock options, and the stock option was exercised before the fiscal year. Different from his compensation package, the other executives received less amount of stock options. To get the result, I used the code below and used the command that "perl code.pl sample.txt "Steve Jobs" "Executive" 10 "stock option" "package" However, the error message occurs. The error message is "Use of uninitialized value $distance in numeric le <<=> at line..." Could you please give me some advice to get the result I want? I am attaching the sample text and the code that I used. In the sample text, there are three different articles and it is divided by "Document ". So, I expect to get the results for the three articles. I am looking forward to your responses. I hope you all have a great weekend! I really appreciate it in advance. PERL code)

use strict; use warnings; my ($filename, @mainword, $distance, @search) = @ARGV; my $content; open my $fh, '<', $filename or die $!; local $/ = undef; $content = <$fh>; close $fh; my @docs = split 'Document ', $content; foreach my $doc ( @docs ) { my $count = 0; my $mainword = '(' . (join '|', map { "\Q$_\E" } @mainword) . ')'; my $search = '(' . (join '|', map { "\Q$_\E" } @search) . ')'; for (my $dist = 0; $dist <= $distance; $dist++) { while ( $doc =~ / (?:^|\W) $search (?= (?:\W++\w++){$dist} \W++\Q$mainword\E ) /ixsg ) { print " found [$1] at ", $-[1], "\n"; $count++; } while ( $doc =~ / (?:^|\W) \Q$mainword\E (?= (?:\W++\w++){$dist} \W++$search ) /ixsg ) { print "-found [$1] at ", $-[1], "\n"; $count++; } } print "match: $count\n"; }

Replies are listed 'Best First'.
Re: Counting the keywords in the text file
by kcott (Archbishop) on Dec 01, 2013 at 03:19 UTC

    G'day moviesigh,

    Welcome to the monastery.

    You have a problem with:

    my ($filename, @mainword, $distance, @search) = @ARGV;

    That code will assign values as follows:

    • $filename will be $ARGV[0]
    • @mainword will be @ARGV[1 .. $#ARGV] [Note: all of @ARGV has now been used!]
    • $distance will be undefined
    • @search will be an empty array

    For reading complex arguments from the command line, I'd suggest Getopt::Long.

    Update: Oops! Typo in my array slice: s/$ARGV[1 .. $#ARGV]/@ARGV[1 .. $#ARGV]/

    -- Ken

      Thank you very much for your comments. I will check the module! Sean
Re: Counting the keywords in the text file
by Athanasius (Archbishop) on Dec 01, 2013 at 08:46 UTC

    Hello moviesigh, and welcome to the Monastery!

    In addition to the problem identified by kcott, there are some problems with your matching logic. Here are two:

    First, note that \Q disables pattern metacharacters until the next occurence of \E (see “Escape Sequences” in Regular Expressions). But the regex in your first while loop uses the variable $mainword which has been initialised to (Steve\ Jobs|Executive), and the pipe symbol | needs to be a metacharacter for the regex logic to work.

    Second, I have my doubts about the for loop — which, by the way, would be better written:

    for my $dist (0 .. $distance) {

    Quantifying a regex match using {0} (which is what you get on the first iteration) serves no purpose. But I don’t think you want a loop here at all? Something like {1,$dist} might better capture the intention?

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Thank you very much for your comments. I will try to check and fix the problems you mentioned!

      Sean

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1065109]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (8)
As of 2024-04-18 17:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found