Counting the keywords in the text file

moviesigh has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I would like to count the frequency of certain keywords in the text file, which is sample.txt. For example, I determine a main word as "Steve Jobs" and "Executive," and I would like to count the frequency of "stock option" and "package" within 10 words from "Steve Jobs" and "Executive" for the sample text below. The result that I expected is 4. Sample text) Stock option is the most popular compensation policy in the world these days. Steve Jobs also received huge amount of stock options, and the stock option was exercised before the fiscal year. Different from his compensation package, the other executives received less amount of stock options. To get the result, I used the code below and used the command that "perl code.pl sample.txt "Steve Jobs" "Executive" 10 "stock option" "package" However, the error message occurs. The error message is "Use of uninitialized value $distance in numeric le <<=> at line..." Could you please give me some advice to get the result I want? I am attaching the sample text and the code that I used. In the sample text, there are three different articles and it is divided by "Document ". So, I expect to get the results for the three articles. I am looking forward to your responses. I hope you all have a great weekend! I really appreciate it in advance. PERL code)


use strict;
use warnings;

my ($filename, @mainword, $distance, @search) = @ARGV;

my $content;
open my $fh, '<', $filename or die $!;
local $/ = undef;
$content = <$fh>;
close $fh;

my @docs = split 'Document ', $content;
foreach my $doc ( @docs ) {

    my $count = 0;

    my $mainword = '(' . (join '|', map { "\Q$_\E" } @mainword) . ')';
    my $search = '(' . (join '|', map { "\Q$_\E" } @search) . ')';


    for (my $dist = 0; $dist <= $distance; $dist++) {
        while ( $doc =~ /
            (?:^|\W)                        
            $search                        
            (?=                           
                (?:\W++\w++){$dist}       
                \W++\Q$mainword\E         
            )
            /ixsg
        )
        {
            print " found [$1] at ", $-[1], "\n";

            $count++;
        }

        while ( $doc =~ /
            (?:^|\W)
            \Q$mainword\E
            (?=
                (?:\W++\w++){$dist}
                \W++$search
            )
            /ixsg
        )
        {
            print "-found [$1] at ", $-[1], "\n";
            $count++;
        }
    }

    print "match: $count\n";
}
[download]

Comment on Counting the keywords in the text file Download Code

Replies are listed 'Best First'.

Re: Counting the keywords in the text file
by kcott (Archbishop) on Dec 01, 2013 at 03:19 UTC

G'day moviesigh,

Welcome to the monastery.

You have a problem with:

my ($filename, @mainword, $distance, @search) = @ARGV;
[download]

That code will assign values as follows:

$filename will be $ARGV[0]
@mainword will be @ARGV[1 .. $#ARGV] [Note: all of @ARGV has now been used!]
$distance will be undefined
@search will be an empty array

For reading complex arguments from the command line, I'd suggest Getopt::Long.

Update: Oops! Typo in my array slice: s/$ARGV[1 .. $#ARGV]/@ARGV[1 .. $#ARGV]/

-- Ken

[reply]
[d/l]
[select]

Re^2: Counting the keywords in the text file

by moviesigh (Initiate) on Dec 05, 2013 at 01:13 UTC

Thank you very much for your comments. I will check the module! Sean

[reply]

Re: Counting the keywords in the text file
by Athanasius (Archbishop) on Dec 01, 2013 at 08:46 UTC

Hello moviesigh, and welcome to the Monastery!

In addition to the problem identified by kcott, there are some problems with your matching logic. Here are two:

First, note that \Q disables pattern metacharacters until the next occurence of \E (see “Escape Sequences” in Regular Expressions). But the regex in your first while loop uses the variable $mainword which has been initialised to (Steve\ Jobs|Executive), and the pipe symbol | needs to be a metacharacter for the regex logic to work.

Second, I have my doubts about the for loop — which, by the way, would be better written:

for my $dist (0 .. $distance) {
[download]

Quantifying a regex match using {0} (which is what you get on the first iteration) serves no purpose. But I don’t think you want a loop here at all? Something like {1,$dist} might better capture the intention?

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Counting the keywords in the text file

by moviesigh (Initiate) on Dec 05, 2013 at 01:15 UTC

Thank you very much for your comments. I will try to check and fix the problems you mentioned!

Sean

[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks