Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

What's the best way to do a pattern search like this?

by supernewbie (Beadle)
on Jul 20, 2001 at 10:00 UTC ( #98338=perlquestion: print w/ replies, xml ) Need Help??
supernewbie has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Masters,
if I have a file(talk.txt) like this:
baaba ba abab abab abab baaba baaba babaa. abab aaba ba abab ba. babab +ab abab abab ba aaba. ba bababab aaba abab babaa baaba ba baaba. aaba + ba bababab ba bababab abab ba aaba abab baaba abab. ba abab abab ba.
What's the best way to find out how many times the same word appeared in the file? I mean I want to print out something like this:
abab 12 aaba 4 ba 11 baaba 3 babaa 2 ........
Please enlighten a newbie pre-monk....

Comment on What's the best way to do a pattern search like this?
Select or Download Code
Re: What's the best way to do a pattern search like this?
by MeowChow (Vicar) on Jul 20, 2001 at 10:12 UTC
    Neglecting for a moment that the devil is in the details:
    sub word_count { my %h; $h{$_}++ for pop =~ /\w+/g; %h; } ## Example ## use Data::Dumper; my $s = 'fee fi fo fo fi fee fo fum fum bar baz'; my %h = word_count($s); print Dumper \%h;
    You may want to replace the regex with something like /[a-z]+(?:'[a-z]+)?/gi, in order to properly count conjunctive words.
       MeowChow                                   
                   s aamecha.s a..a\u$&owag.print

      supernewbie wanted an explanation of MeowChows sub:

      First declare the sub

      sub word_count {

      Next we declare a lexically scoped has called %h the % indicates that this is a hash and the h is a typical MeowChow explanatory long var name :-)

      my %h;

      This is a bit of very idiomatic perl

      $h{$_}++ for pop =~ /\w+/g;

      It is fairly easy to understand if you read it R->L. The expression:

      pop =~ /\w+/g

      pop()s the last value off @_ which is the array passed to a subroutine called like mysub(@myarray). This gets us the value passed to the sub. We then use a regular expression to match \w+ which is groups of letters (as many in a row a possible) but not whitespace. Because this is referenced in LIST context by the for it returns a list of words which the for iterates over assigning each value to the magical $_ variable.

      Finally we use out hash to count the occurances of each word (code). A hash stores a key value pair. Thus the key we are using is $_. The ++ part increments the value of $h{$_} by one each time we see the key.

      %h

      In a perl sub the sub returns the last value evaluated so this is shorhand for the more usual return %h

      Hope this helps

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: What's the best way to do a pattern search like this?
by tachyon (Chancellor) on Jul 20, 2001 at 10:33 UTC

    Here is an example for you using a hash

    # declare our vars my (%codes, @array_codes); #undef input record sep to get all data at once local $/; # make an array of codes by splitting DATA on whitespace @array_codes = split /\s+/, <DATA>; # map the codes to a hash, counting duplicates # using a for loop for efficiency foreach $code_key (@array_codes) { $codes{$code_key}++; } # print it out printf "$_\t$codes{$_}\n" for keys %codes; __DATA__ baaba ba abab abab abab baaba baaba babaa. abab aaba ba abab ba. bababab abab abab ba aaba. ba bababab aaba abab babaa baaba ba baaba. aaba ba bababab ba bababab abab ba aaba abab baaba abab. ba abab abab ba.

    Note that: map{....}@array is just another way of writing: for (@array) { .. }. To do it to a file all you need to do to use this is do somthing like:

    sub count_codes { my $file = shift; open (FILE, "<$file") or die "Oops, perl says $!\n"; local $/; my @array_codes = split /\s+/, <FILE>; close FILE; foreach $code_key (@array_codes) { $codes{$code_key}++; } printf "$_\t$codes{$_}\n" for keys %codes; } # call sub count_codes("/path/to/myfile.txt");

    You have some full stops in there which I have assumed are part of the codes. If they are not you will need to filter them out using a regex in our for loop like this:

    foreach $code_key (@array_codes) { $code_key =~ s/[.]//g; $codes{$code_key}++; }

    If you want filter out more characters add them to the character class between the [ ]

    cheers

    tachyon

    Update

    Removed lazy and inefficient map and replaced with proper for loop. Even typed foreach to remind me not to be so slack.

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      If I don't mention it, someone else will. Don't suggest the use of map in a void context. You are taking the trouble to build a whole return list, which you just throw away. It is more efficient and idiomatic to use for for such tasks.
         MeowChow                                   
                     s aamecha.s a..a\u$&owag.print

        Good point, I'll update the code. It's too much Golf you know, shaving those two chars by using map instead of for.

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

      I tried your method. Everything works great, execpt the program will return something like:
      ba. 1 ba 2 ........
      Should I do a s/\./ / on file.txt before process it through your function? What if there are other things like ? ! : ; " ' ( ) ...etc..

        You just need to adjust the regex a little.

        my @array_codes = split /\s+/, <FILE>;

        assumes that you're interested in all non-whitespace characters. Changing it to:

        my @array_codes = split /\W+/, <FILE>;

        means that your're only interested in non-word characters (where word chars are A-Z, 0-9 and '-').

        --
        <http://www.dave.org.uk>

        Perl Training in the UK <http://www.iterative-software.com>

        Hi, you have two options. If you wish to retain ultimate control split on whitespace and filter the elemets in @array_codes using this (as above)

        $code_key =~ s/[.?!:;"'()]//g;

        This filters out all the stuff in the char class. Alternatively you can just grab alphanumerics in the first place like this:

        @array_codes = <DATA> =~ m/\w+/g;

        cheers

        tachyon

        s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: What's the best way to do a pattern search like this?
by CharlesClarkson (Curate) on Jul 20, 2001 at 10:58 UTC

    Some things to ponder:

    How should the algorithm handle hyphenated words? Should pre-paid become pre and paid or remain pre-paid? Will any words wrap to the next line using a hyphen?

    Are there any slang or shortcut words in the file? How should b4 be handled?

    Is the file short or long? Should the algorithm read the entire file into memory or would it be better to process each line?

    How might you handle dates: 500 A.D., c. 1500 bc.

    And what about other abreviations: Mr. Jr. Ave. etc. e.g.


    HTH,
    Charles K. Clarkson

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://98338]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2014-09-19 03:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (129 votes), past polls