Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

(ichimunki) Re x 2: Excluding words with more than three a

by ichimunki (Priest)
on Dec 20, 2001 at 22:25 UTC ( #133552=note: print w/replies, xml ) Need Help??

in reply to Re: Excluding words with more than three a
in thread Excluding words with more than three a

Aha! Even if you are not our original AnonyMonk querant, we at least from atcroft get code which purports to work but for which the author is still asking questions! Excellent.

While there is good reason to avoid *any* premature optimization at the "tax the machine" level... we have some algorithmic improvements we might make here to enhance readability and future maintainence. To wit: you are setting up two loops, the foreach, then the grep-- as such testing each word as many times as there are tests. In some sense this is necessary since we have more than one test we need to apply to each word, in another sense it probably isn't the best way to go about it.

I also think we might be able to avoid keeping the whole list in memory (although I understand we might hard code the list into our script for testing). We'd really like to be able to adapt our technique to streams of data, so we don't necessarily have to have all the data at once and then we could potentially filter an infinite list.

With this in mind, let's see if we can't find a test that, in as few steps as possible, will evaluate a single word (and for the purposes of this demo, I'm going to assume we're looking for words which have double characters, rather than the multiples shown, we can easily adjust the REs again later to do the larger tests). Working with your code we might come up with something like $my_word =~ m/a.*a/. Which is great, except that .* is a greedy little thing, and even if we feed it 'ababab', it will start at 'a' then find 'aba'. Ah, a match, but it's not done-- it's trying to match as much as possible (which is why we call it greedy). It keeps examining the string until we get to 'ababa'... and even then it has to verify that there are no further 'a's in our string. If we use $my_word =~ m/a.*?a/ we tell the RE engine to stop looking as soon as it finds the first match, 'aba' in this case.

So now we've got the first test... how to do the other tests? We could set up a truth value, like $is_matching, and then do a series of if( $my_word =~ m/a.*?a/ ) { $is_matching = 1; } statements, changing the contents of the regex each time, but then we duplicate our $is_matching blocks. So we might try if ( $my_word =~ m/a.*?a/ || $my_word =~ m/b.*?b/ ) { $is_matching = 1; }. But we can probably skip the truth condition now and just put our action into the block. So for this exercise print "$my_word\n" if (...) (note we're switching the block to the front and putting if at the end). Also, we're separating these matches and using || to short-circuit. Since our condition is true if any part is true, this stops testing as soon as it finds a true part.

Finally we need to get our test into a block worth worrying about. We might make it a portable sub, we might make it an anonymous code reference (almost a sub), and we might just put it right in the block of some other list iterator like foreach, map, or grep-- and yes, we could even call perl with the -n flag and have it feed lines from STDIN to our program.

But I think we might opt for a simple while( <STDIN> ) { ... test ... } approach. This has the same effect on our program as calling it with perl -n but lets us forget to do that. Then we can put our program into a stream easily. On my W2K machine I type type dict | perl > dict2 (on Unix/Linux this would be cat dict | ./ > dict2-- these both take a dict file, filter it, and write it to a dict2 file). And since we've written the program fairly usefully, we might later modify it so that we can call it with perl dict dict2... but I digress.

Here is our program so far:
#!/usr/bin/perl -w use strict; while( <STDIN> ) { print if not( /a.*?a/ || /b.*?b/ ); }

Yes, we got rid of $my_word. We can rely on the fact that the while is assigning each line from STDIN to $_ and giving that to our block and the match is going to default to comparing our expression against $_ if no variable is given-- and print will print $_ if no argument is given, too.

There are further things we might eventually consider to optimize this routine for the CPU (after we're done writing it, though), like adding the /o flag to our regular expressions. If they aren't going to change, there's no reason for perl to recompile them each time it uses them, which it might if we don't tell it not to. If we change the print to a return, we also have a block which is easy to put inside a grep or map to take a list and assign to another list. And we could go on forever... and my apologies if parts of this rather long post were oversimplified.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://133552]
[choroba]: Email::Address DoS

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (8)
As of 2018-06-20 11:52 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (116 votes). Check out past polls.