match sequences of words based on number of characters

nicemank has asked for the wisdom of the Perl Monks concerning the following question:

I want to extract sequences of words according to how many characters in each word.

So I want to extract for instance a sequence based on the number of characters (here defined as letters of the alphabet - not punctuation, numbers, white space).

For instance: I want sequences of 2, 4 and 3 character words - in that order only (but it could be any numbers of characters in any order I choose).

Say my text is: "xxxx yy zzzzz xxxx qqq"

I should extract the sequence: "yy xxxx qqq"

and keep on doing it. So from "xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq"

I should extract

"yy xxxx qqq yy xxxx qqq"

 
my $string = "xxxx yy zzzzz xxxx qqq";  

my @array = ( $string =~  /(\b..?\b) (\b....?\b)  (\b...?\b)/sg );

print @array;


# produces nothing. 

# I have also tried rewriting it without success:  it may 
# produce results, but not the right ones! (not the exact 
# sequence)

# also if the string were longer it should produce
# the sequence repeated: 
# "xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq" 
# should produce "yy xxxx qqq  yy xxxx qqq" etc etc 
# until we run out of text.
[download]

I have also tried running adaptions of remiah's code, but without success: http://www.perlmonks.org/?node_id=996670. The problem/task differs and I cannot adapt the code to it. Inability! nicemank thanks in advance!

Comment on match sequences of words based on number of characters Download Code

Replies are listed 'Best First'.
Re: match sequences of words based on number of characters by AnomalousMonk (Archbishop) on Feb 17, 2013 at 19:56 UTC
Another way: >perl -wMstrict -le "my $s = 'aaaa bb ccccc ddd eeeeeee ffff gg hhhhh iii jjjjjjj'; ;; for my $ar ([2, 5, 3], [3, 7, 4], [4, 2],) { my $rx = rxg(@$ar); print $rx; my @groups = $s =~ m{ ($rx) }xmsg; print qq{'$_' } for @groups; } ;; sub rxg { my ($rx) = map qr{ \b $_ \b }xms, join ' \s+ ', map qq{\\w{$_}}, @_ ; ;; return $rx; } " (?^msx: \b \w{2} \s+ \w{5} \s+ \w{3} \b ) 'bb ccccc ddd' 'gg hhhhh iii' (?^msx: \b \w{3} \s+ \w{7} \s+ \w{4} \b ) 'ddd eeeeeee ffff' (?^msx: \b \w{4} \s+ \w{2} \b ) 'aaaa bb' 'ffff gg' [download]	[reply] [d/l]
Re^2: match sequences of words based on number of characters by nicemank (Novice) on Feb 17, 2013 at 21:17 UTC
Thanks for your kind efforts here. But I have tried running this but it produces an error: "Undefined subroutine &main::rxg called at whatnot.pl line 9". You might be assuming something; but have I missed it....? nicemank.	[reply]
Re^3: match sequences of words based on number of characters by AnomalousMonk (Archbishop) on Feb 17, 2013 at 21:47 UTC
... "Undefined subroutine &main::rxg called at whatnot.pl line 9". I don't know what's in whatnot.pl, but somewhere in there must be the subroutine definition for `rxg()` that I included in my post above; please take another look.	[reply] [d/l]
Re^2: match sequences of words based on number of characters by frozenwithjoy (Priest) on Feb 17, 2013 at 23:56 UTC
Based on the examples, I don't believe that nicemank is requiring captured words to be adjacent. Maybe change `\s+` to some non-greedy length of characters.	[reply] [d/l]
Re^3: match sequences of words based on number of characters by AnomalousMonk (Archbishop) on Feb 18, 2013 at 03:25 UTC
Based on the examples, I don't believe that nicemank is requiring captured words to be adjacent. Hmmm... After taking another look at the OP, I think you may be right. In which case: `>perl -wMstrict -le "my $s = 'xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq'; ;; for my $ar ([2, 4, 3], [5, 3]) { my $rx = rxg(@$ar); print $rx; my @groups = $s =~ m{ $rx }xmsg; print qq{'$_'} for @groups; } ;; sub rxg { my ($rx) = map qr{ \b $_ \b }xms, join ' \b .+? \b ', map qq{\\w{$_}}, @_ ; ;; return $rx; } " (?^msx: \b \w{2} \b .+? \b \w{4} \b .+? \b \w{3} \b ) 'yy zzzzz xxxx qqq' 'yy zzzzz xxxx qqq' (?^msx: \b \w{5} \b .+? \b \w{3} \b ) 'zzzzz xxxx qqq' 'zzzzz xxxx qqq'` [download] Update: No, darn it, that's still not right! nicemank seems to want `'yy xxxx qqq'` from `'yy zzzzz xxxx qqq'`. Oh, well...	[reply] [d/l] [select]
Re^4: match sequences of words based on number of characters by frozenwithjoy (Priest) on Feb 18, 2013 at 04:00 UTC
Re: match sequences of words based on number of characters by frozenwithjoy (Priest) on Feb 17, 2013 at 21:37 UTC
Like this? `perl -E ' my $string = "xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq"; my @array = $string =~ /\b(\w{2})\b.+?\b(\w{4})\b.+?\b(\w{3})\b/g; say "@array"; ' yy xxxx qqq yy xxxx qqq` [download] *Edit: here is an approach that lets you auto-customize the regex.* `#!/usr/bin/env perl use strict; use warnings; use feature 'say'; my $regex = build_regex( 2, 4, 3 ); say "Regex: $regex"; my $string = "xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq"; my @match = $string =~ /$regex/g; say "Match: @match"; sub build_regex { my ( $first, @others ) = @_; my $regex = qr{\b(\w{$first})\b}; $regex .= qr{.+?\b(\w{$_})\b} for @others; return $regex; } __END__ Regex: (?^:\b(\w{2})\b)(?^:.+?\b(\w{4})\b)(?^:.+?\b(\w{3})\b) Match: yy xxxx qqq yy xxxx qqq` [download]	[reply] [d/l] [select]
Re: match sequences of words based on number of characters by Kenosis (Priest) on Feb 18, 2013 at 01:31 UTC
As another option, here's a subroutine to which you send a string and a list of lengths. Matches are returned as an array of arrays (AoA), and the array's empty if nothing's matched: `use strict; use warnings; my $str = 'xxxx yy zzzzz xxxx qqq xxxx vv zzzzz wwww ppp'; my @lengths = qw/2 4 3/; my @seqs = getSequences( $str, @lengths ); print "@$_\n" for @seqs; sub getSequences { my ( $string, @lengths ) = @_; my ( $i, @sequences ) = 0; my $re = join '\b.+?\b', map { $i++; "(?<C$i>[a-z]{$_})" } @length +s; push @sequences, [ map $+{"C$_"}, 1 .. @lengths ] while $string =~ /\b$re\b/ig; return @sequences; }` [download] Output: `yy xxxx qqq vv wwww ppp` [download] You mentioned only letters, so `[a-z]` was used in the regex. However, you may use `\\w` instead, if that works better for you. Of course, sending the subroutine different lists of lengths produces different results, as the regex is dynamically built.	[reply] [d/l] [select]
Re^2: match sequences of words based on number of characters by nicemank (Novice) on Feb 18, 2013 at 09:43 UTC
And Kenosis's code also works (unajacent words if needed, any combination of characters). thanks to you, nicemank.	[reply]
Re: match sequences of words based on number of characters by Anonymous Monk on Feb 17, 2013 at 18:22 UTC
add `use re 'debug';` to see what the regex engine is doing, and why your match fails Then, use my @list = grep $lengthy, split /\W/, $str Where `my $lengthy = do { my @lengths = ( 2, 6, 6 ); my $lix = 0; sub { if( $lix < @lis and $lengths[ $lix ] == length $_ ){ $lix++; return !!1; } return !!0; } };` [download] or whatever counting logic you require	[reply] [d/l]
Re^2: match sequences of words based on number of characters by AnomalousMonk (Archbishop) on Feb 17, 2013 at 20:09 UTC
I tried a few variations and I can't seem to make that work. Can you supply a self-contained, working example?	[reply]
Re^3: match sequences of words based on number of characters by Anonymous Monk on Feb 18, 2013 at 00:17 UTC
I tried a few variations and I can't seem to make that work. Show your efforts :) What do you think is wrong with it? Can you supply a self-contained, working example? Theoretically :) did you try Basic debugging checklist? I think this ought to show what is wrong with my syntax `#!/usr/bin/perl -- use strict; use warnings; use Data::Dump; my $str = "xxxx yy zzzzz xxxx qqq"; my $lengthy = sub { warn 1 }; my @list = grep $lengthy, split /\W/, $str; dd \@list; __END__` [download] Its basically as if I wrote grep 1, ... and here I thought grep knew to take a subroutine reference, it works with grep \&somename, but it has to be grep $lengthy->(),... And on top of that no warnings of any kind, surprising	[reply] [d/l]
Re^4: match sequences of words based on number of characters by Anonymous Monk on Feb 18, 2013 at 00:28 UTC
Re^3: match sequences of words based on number of characters by nicemank (Novice) on Feb 17, 2013 at 21:31 UTC
Hi, when I run your code it produces nothing. Did I miss something? Thanks for your advice, nicemank.	[reply]
Re^4: match sequences of words based on number of characters by AnomalousMonk (Archbishop) on Feb 17, 2013 at 21:56 UTC


Don't ask to ask, just ask
	PerlMonks