Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

word boundary match problem

by martymart (Deacon)
on Jun 23, 2003 at 16:10 UTC ( [id://268220]=perlquestion: print w/replies, xml ) Need Help??

martymart has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,
I wrote a nice little scipt (or so I thought) for just performing a wordcount on a text file. Except it does not always give the right answer. It works most of the time, but on some larger text files (and I'm only talking about 4kb here) it gives me a wordcount of 171 (where the actual wordcount is close to 570). Script is below, everything is pretty standard, I figure it must be the match itself that is falling down in certain situations... any advice would be appreciated
#!/usr/bin/perl use strict; open (SOURCE, "test.txt")||die "Can't open test.txt: $!"; my @source=<SOURCE>; close (SOURCE); my $size=0; while(<@source>){ $size++ while m{\b\w+\b}g; } print "wordcount: $size words\n";
Thanks for your help,
Martymart

Replies are listed 'Best First'.
Re: word boundary match problem
by Mr. Muskrat (Canon) on Jun 23, 2003 at 16:21 UTC

    Change the first while to a foreach and remove the angle brackets from around @source and it'll work.

    #!/usr/bin/perl -w use strict; open (SOURCE, '<', 'test.txt') || die "Can't open test.txt: $!"; my @source = <SOURCE>; close (SOURCE); my $size=0; foreach(@source){ $size++ while m{\b\w+\b}g; } print "wordcount: $size words\n";

Re: word boundary match problem
by BrowserUk (Patriarch) on Jun 23, 2003 at 16:25 UTC

    The problem is that you're not counting the words in the file you're ... um.. To be honest, I'm not quite sure what you are counting. 171 seems to big a value for you to be counting the number of words in the filenames of files in the current directory whos names match one of the lines in your datafile, but I think that is what you are doing.

    The problem is this line while(<@source>){.

    This isn't iterating over the array @source! It is supplying the contents of @source to the diamond operator <>, which in this context performs a glob on each of its arguments against the current working directory. And you are counting the results of this glob? The number still seems to high ... but whatever. To make your program work you could change to

    #!/usr/bin/perl use strict; open (SOURCE, "test.txt")||die "Can't open test.txt: $!"; my @source=<SOURCE>; close (SOURCE); my $size=0; for( @source ){ $size++ while m{\b\w+\b}g; } print "wordcount: $size words\n";

    There are several better ways to count words in files, but that should get you going.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


      I'm not quite sure what you are counting. 171 seems to big a value for you to be counting the number of words in the filenames of files in the current directory whos names match one of the lines in your datafile, but I think that is what you are doing.

      glob returns the literal pattern when there isn't a match, which in this case will be the lines of the file.

      /J\
      
Re: word boundary match problem
by gellyfish (Monsignor) on Jun 23, 2003 at 16:25 UTC

    I'm surprised this works at all - it certainly isn't doing what you think it is. The :

    while(<@source>) { ... }
    is actually doing a glob based on the contents of the file and is making $_ the actual line where (not unexpectedly) there isn't a match.

    I think you meant:

    foreach ( @source ) { ... }
    or
    while(<SOURCE>) { ... }
    if the latter you will need to move the close to after the loop.

    /J\
    
Re: word boundary match problem
by Zaxo (Archbishop) on Jun 23, 2003 at 16:25 UTC
    Your inner while isn't doing what you want, and the outer one is badly formed. You have set up @source so that for is the loop you want. Match in list context will count the words on a line, so this will do:
    my $size = 0; for (@source) { $size += () = m{\b\w+\b}g; }

    After Compline,
    Zaxo

Re: word boundary match problem
by Tomte (Priest) on Jun 23, 2003 at 16:28 UTC

    Funny, with my little test.txt, it counts to much words, due to umlauts... :)

    I'd modifie the line inside the loop as

    $size++,print "-".$1."-" while m{\b(\w+)\b}g;
    to see where it fails.

    regards,
    tomte


    Hlade's Law:

    If you have a difficult task, give it to a lazy person --
    they will find an easier way to do it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://268220]
Approved by Mr. Muskrat
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-09-07 23:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.