Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Counting words in headlines

by mooseboy (Pilgrim)
on Feb 04, 2003 at 09:13 UTC ( #232502=perlquestion: print w/replies, xml ) Need Help??

mooseboy has asked for the wisdom of the Perl Monks concerning the following question:

Good morning monks,
Hope the regex experts out there can help out on this one: I have a large file of international news stories and want to count how many stories there are from each country (doesn't have to be exact). The file is formatted like so:

Headline of story is here Text of story is here Text of story is here Text of story is here Headline of story is here Text of story is here Text of story is here Text of story is here

Eyeballing the file shows that most headlines do in fact have the country name in them, so it seems like OWTDI would be just to count the occurrences of country names in the headlines only, ignoring the text. How can I modify the regex in the following loop to do that?

while (<NEWS>) { foreach my $country (@countries) { $story_count{$country}++ if m/$country/gi; } }

Thanks in advance, mooseboy

Replies are listed 'Best First'.
Re: Counting words in headlines
by MarkM (Curate) on Feb 04, 2003 at 09:21 UTC

    For an initial tempt, I would run with the following:

    1. Read the file in "paragraph" mode. Detect headlines by locating "paragraphs" that have only a single line of text.
    2. Store a word count for header lines into a hash. Note: Force lowercase as a canonical representation.
    3. Lookup each country in the hash to find the count. Note: Force lowercase. See above.

    Example:

    # Maintain a word count for words found in header lines. my %header_words; # Read text in paragraph mode. $/ = ''; # Read one paragraph at a time. while (<NEWS>) { # Only consider paragraphs that contains a single line of text. if (/\A\s*\S[^\r\n]*\s*\z) { $header_words{lc $_}++ for /(\w+)/g; } } # For each country, obtain the word count. for my $country (@countries) { my $count = $header_words{lc $country} || 0; print "$count $country\n"; }

      Thanks, seems to work nicely!

      PS: trailing slash missing from (/\A\s*[^\r\n]+\s*\z)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://232502]
Approved by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (9)
As of 2021-06-22 10:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (104 votes). Check out past polls.

    Notices?