Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

regular expressions

by mbgbioinfo (Novice)
on Jun 06, 2015 at 19:29 UTC ( #1129310=perlquestion: print w/replies, xml ) Need Help??

mbgbioinfo has asked for the wisdom of the Perl Monks concerning the following question:

Dear PerlMonks, I would like to ask for your wisdom once again. I want to create a program which will open a file, read all its lines into an array and find words whihch they have 4 or more consonants in the row. I created the following program but I have in my terminal all the words (it's like grep is not working).
#!/usr/bin/perl -w use strict; use warnings; open(MYFILE, "fil") || die "$!"; my@fil=<MYFILE>; close(MYFILE); chomp(@fil); my@outcome=grep(/[bBcCdDfFgGhHjJkKlLmMnNpPqQrRsStTvVwWxXzZ]{4,}/, @fil +); print @outcome, "\n";

Replies are listed 'Best First'.
Re: regular expressions
by toolic (Bishop) on Jun 06, 2015 at 20:47 UTC
    Your grep filters out entire lines. Do you have multiple words on each line? If so, all you need is one word on a line to have 4 consecutive consonants to get a match.

    Another way, using a negated character class:

    use warnings; use strict; use Data::Dumper; my @words; while (<DATA>) { chomp; push @words, grep { /[^aeiouy]{4}/i } split; } print Dumper(\@words); __DATA__ abc def ghi AAAAAA jlkm opqr jhggjyg 123 annn jkjkkj bcdefgh

    Prints:

    $VAR1 = [ 'jlkm', 'jhggjyg', 'jkjkkj' ];
      I do not think that a negated character class is a good idea for looking for groups of consonants, because, for example, it will pick groups of digits, as shown below under the Perl debugger:
      DB<1> $_ ="123 annn jkjkkj bcdefgh 2015 "; DB<2> push @words, grep { /[^aeiouy]{4}/i } split; DB<3> x \@words; 0 ARRAY(0x600500b18) 0 'jkjkkj' 1 2015 DB<4>

        I agree that doubly-negated character classes can be very tricky, but with care, they can be managed to good effect.

        I think of it this way: Start with  [^\W] which is the same as  [\w] (or just \w). As you point out, this includes digits and _ (underscore) as well as alphas. "Subtract", as it were, the digits with  [^\W\d] and underscore with  [^\W\d_] and you're left with all alpha characters. Then subtract your chosen vowels  [^\W\d_aeiouyAEIUOY] and you're done!

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = '123 annn xyzzy wwwewww xxx9xxx vvv_vvv eieio p pp ppp 2015 v +wxz vwxzpdq'; ;; my $consonant = qr{ [^\W\d_aeiouyAEIUOY] }xms; ;; printf qq{'$_' } for $s =~ m{ $consonant{4,} }xmsg; " 'vwxz' 'vwxzpdq'

        All this is easier to manage, IMHO, with POSIX character classes or Unicode properties (if you're brave enough to venture out onto the thin, slippery ice of Unicode); both the following definitions work the same in the code above:
            my $consonant = qr{ [^[:^alpha:]aeiouyAEIUOY] }xms;
            my $consonant = qr{ [^\P{PosixAlpha}aeiouyAEIUOY] }xms;
        YMMV. See perlrecharclass, perluniprops.

        (See also the experimental Extended Bracketed Character Classes of version 5.18+; I can't give any examples using these ATM.)


        Give a man a fish:  <%-(-(-(-<

Re: regular expressions
by stevieb (Canon) on Jun 06, 2015 at 19:48 UTC

    Here's one way you could grab out the words. I didn't use grep(), I just did a comparison of each word against the regex directly. I also changed your open() statement to coincide with the recommended way to write them, and used ranges in the regex just so you're aware they are available. Note the 'i' after the regex; that's to make the regex case-insensitive.

    #!/usr/bin/perl use strict; use warnings; open my $fh, '<', "input.txt" or die "Can't open the file: $!"; my @words; for my $line (<$fh>){ chomp $line; for my $word (split(/\s+/, $line)){ if ($word =~ /[b-df-hj-np-tv-z]{4}/i){ push @words, $word; } } } print "$_\n" for @words;

    -stevieb

Re: regular expressions
by Anonymous Monk on Jun 06, 2015 at 20:43 UTC
    perl -lne 'print for /\w*[^\WaeiouyAEIOUY]{4,}\w*/g' /usr/share/dict/w +ords
Re: regular expressions
by Marshall (Canon) on Jun 07, 2015 at 15:36 UTC
    This is actually pretty good. But...

    One flaw is that the regex does not capture multiple tokens that meet the pattern - the paren's below do that and the result is an array. This is called "match global" in Perl lingo.

    Another problem is that the regex syntax to match 4 or more is not quite right. {4,} should be {4,}?. The first version would just match 4 at a minimum, but no more. That following ? does matter!

    Also to split on "words", space separated tokens, I used the default "split". There are actually 2 different versions of this "default" split. One without parens and one with parens and they work slightly differently when dealing with the beginning of a line. Here, it makes no difference.

    I also used a Perl "trick" that can embed comments within the code. This "trick" can also be used to generate documentation in web format. Here I just used it to put my output/comments into the compilable and runnable code. That way I don't have to send you 2 different files, one with code and one with output.

    Oh, using the -w switch for a single program like this turns on warnings. The "use warnings;" is not necessary. This also works under Windows. Wow!

    I always use strict; and use warnings;. There is a small performance hit for this. But it is almost always worth it. Keep doing that!

    #!/usr/bin/perl -w use strict; while (<DATA>) { print "INPUT LINE: $_"; my @four_constants = grep{/([bBcCdDfFgGhHjJkKlLmMnNpPqQrRsStTvVwWxXzZ]{4,}?)/g} split; #the ? allows more than a min of 4! next unless @four_constants; print "output: @four_constants", "\n"; } =EXAMPLE OUTPUT INPUT LINE: xyy xyz INPUT LINE: bBbB output: bBbB INPUT LINE: abc bacx INPUT LINE: abca xyzz INPUT LINE: abCA XXZZ output: XXZZ INPUT LINE: xxyyzzz INPUT LINE: bckz klmx output: bckz klmx INPUT LINE: BKZXXXXXXXXXXXX output: BKZXXXXXXXXXXXX =cut __DATA__ xyy xyz bBbB abc bacx abca xyzz abCA XXZZ xxyyzzz bckz klmx BKZXXXXXXXXXXXX
      ... the regex syntax to match 4 or more is not quite right. {4,} should be {4,}?. The first version would just match 4 at a minimum, but no more.

      The quantifier  {4,} will match as much as possible (while still allowing an overall match), but at least four of the quantified atom. The quantifier  (4,}? will match as little as necessary for an overall match, but at least four of the quantified atom.

      c:\@Work\Perl\monks>perl -wMstrict -le "my @strings = qw(vw vwx vwxz vwxzp vwxzpd vwxzpdq); ;; my $consonant = qr{ [bBcCdDfFgGhHjJkKlLmMnNpPqQrRsStTvVwWxXzZ] }xms; ;; for my $s (@strings) { print qq{'$s'}; print qq{{4,} matched; captured '$1'} if $s =~ m{ ($consonant{4,} +) }xms; print qq{{4,}? matched; captured '$1'} if $s =~ m{ ($consonant{4,}? +) }xms; print ''; } " 'vw' 'vwx' 'vwxz' {4,} matched; captured 'vwxz' {4,}? matched; captured 'vwxz' 'vwxzp' {4,} matched; captured 'vwxzp' {4,}? matched; captured 'vwxz' 'vwxzpd' {4,} matched; captured 'vwxzpd' {4,}? matched; captured 'vwxz' 'vwxzpdq' {4,} matched; captured 'vwxzpdq' {4,}? matched; captured 'vwxz'
      See perlre, perlretut, and perlrequick.


      Give a man a fish:  <%-(-(-(-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1129310]
Approved by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2023-01-30 05:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?