nicemank has asked for the wisdom of the Perl Monks concerning the following question:
I want to extract sequences of words according to how many characters in each word.
So I want to extract for instance a sequence based on the number of characters (here defined as letters of the alphabet - not punctuation, numbers, white space).
For instance: I want sequences of 2, 4 and 3 character words - in that order only (but it could be any numbers of characters in any order I choose).
Say my text is: "xxxx yy zzzzz xxxx qqq"
I should extract the sequence: "yy xxxx qqq"
and keep on doing it. So from "xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq"
I should extract
"yy xxxx qqq yy xxxx qqq"
my $string = "xxxx yy zzzzz xxxx qqq";
my @array = ( $string =~ /(\b..?\b) (\b....?\b) (\b...?\b)/sg );
print @array;
# produces nothing.
# I have also tried rewriting it without success: it may
# produce results, but not the right ones! (not the exact
# sequence)
# also if the string were longer it should produce
# the sequence repeated:
# "xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq"
# should produce "yy xxxx qqq yy xxxx qqq" etc etc
# until we run out of text.
I have also tried running adaptions of remiah's code,
but without success: http://www.perlmonks.org/?node_id=996670.
The problem/task differs and I cannot adapt the code to it. Inability!
nicemank thanks in advance!
Re: match sequences of words based on number of characters
by AnomalousMonk (Archbishop) on Feb 17, 2013 at 19:56 UTC
|
>perl -wMstrict -le
"my $s =
'aaaa bb ccccc ddd eeeeeee ffff gg hhhhh iii jjjjjjj';
;;
for my $ar ([2, 5, 3], [3, 7, 4], [4, 2],) {
my $rx = rxg(@$ar);
print $rx;
my @groups = $s =~ m{ ($rx) }xmsg;
print qq{'$_' } for @groups;
}
;;
sub rxg {
my ($rx) =
map qr{ \b $_ \b }xms,
join ' \s+ ',
map qq{\\w{$_}},
@_
;
;;
return $rx;
}
"
(?^msx: \b \w{2} \s+ \w{5} \s+ \w{3} \b )
'bb ccccc ddd'
'gg hhhhh iii'
(?^msx: \b \w{3} \s+ \w{7} \s+ \w{4} \b )
'ddd eeeeeee ffff'
(?^msx: \b \w{4} \s+ \w{2} \b )
'aaaa bb'
'ffff gg'
| [reply] [d/l] |
|
Thanks for your kind efforts here.
But I have tried running this but it produces an error:
"Undefined subroutine &main::rxg called at whatnot.pl line 9".
You might be assuming something; but have I missed it....?
nicemank.
| [reply] |
|
| [reply] [d/l] |
|
Based on the examples, I don't believe that nicemank is requiring captured words to be adjacent. Maybe change \s+ to some non-greedy length of characters.
| [reply] [d/l] |
|
>perl -wMstrict -le
"my $s = 'xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq';
;;
for my $ar ([2, 4, 3], [5, 3]) {
my $rx = rxg(@$ar);
print $rx;
my @groups = $s =~ m{ $rx }xmsg;
print qq{'$_'} for @groups;
}
;;
sub rxg {
my ($rx) =
map qr{ \b $_ \b }xms,
join ' \b .+? \b ',
map qq{\\w{$_}},
@_
;
;;
return $rx;
}
"
(?^msx: \b \w{2} \b .+? \b \w{4} \b .+? \b \w{3} \b )
'yy zzzzz xxxx qqq'
'yy zzzzz xxxx qqq'
(?^msx: \b \w{5} \b .+? \b \w{3} \b )
'zzzzz xxxx qqq'
'zzzzz xxxx qqq'
Update: No, darn it, that's still not right! nicemank seems to want 'yy xxxx qqq' from 'yy zzzzz xxxx qqq'. Oh, well...
| [reply] [d/l] [select] |
|
Re: match sequences of words based on number of characters
by frozenwithjoy (Priest) on Feb 17, 2013 at 21:37 UTC
|
perl -E '
my $string = "xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq";
my @array = $string =~ /\b(\w{2})\b.+?\b(\w{4})\b.+?\b(\w{3})\b/g;
say "@array";
'
yy xxxx qqq yy xxxx qqq
Edit: here is an approach that lets you auto-customize the regex.
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
my $regex = build_regex( 2, 4, 3 );
say "Regex: $regex";
my $string = "xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq";
my @match = $string =~ /$regex/g;
say "Match: @match";
sub build_regex {
my ( $first, @others ) = @_;
my $regex = qr{\b(\w{$first})\b};
$regex .= qr{.+?\b(\w{$_})\b} for @others;
return $regex;
}
__END__
Regex: (?^:\b(\w{2})\b)(?^:.+?\b(\w{4})\b)(?^:.+?\b(\w{3})\b)
Match: yy xxxx qqq yy xxxx qqq
| [reply] [d/l] [select] |
Re: match sequences of words based on number of characters
by Kenosis (Priest) on Feb 18, 2013 at 01:31 UTC
|
As another option, here's a subroutine to which you send a string and a list of lengths. Matches are returned as an array of arrays (AoA), and the array's empty if nothing's matched:
use strict;
use warnings;
my $str = 'xxxx yy zzzzz xxxx qqq xxxx vv zzzzz wwww ppp';
my @lengths = qw/2 4 3/;
my @seqs = getSequences( $str, @lengths );
print "@$_\n" for @seqs;
sub getSequences {
my ( $string, @lengths ) = @_;
my ( $i, @sequences ) = 0;
my $re = join '\b.+?\b', map { $i++; "(?<C$i>[a-z]{$_})" } @length
+s;
push @sequences, [ map $+{"C$_"}, 1 .. @lengths ]
while $string =~ /\b$re\b/ig;
return @sequences;
}
Output:
yy xxxx qqq
vv wwww ppp
You mentioned only letters, so [a-z] was used in the regex. However, you may use \\w instead, if that works better for you. Of course, sending the subroutine different lists of lengths produces different results, as the regex is dynamically built. | [reply] [d/l] [select] |
|
And Kenosis's code also works (unajacent words if needed, any combination of characters).
thanks to you,
nicemank.
| [reply] |
Re: match sequences of words based on number of characters
by Anonymous Monk on Feb 17, 2013 at 18:22 UTC
|
add use re 'debug'; to see what the regex engine is doing, and why your match fails
Then, use my @list = grep $lengthy, split /\W/, $str
Where
my $lengthy = do {
my @lengths = ( 2, 6, 6 );
my $lix = 0;
sub {
if( $lix < @lis and $lengths[ $lix ] == length $_ ){
$lix++;
return !!1;
}
return !!0;
}
};
or whatever counting logic you require | [reply] [d/l] |
|
| [reply] |
|
I tried a few variations and I can't seem to make that work. Show your efforts :)
What do you think is wrong with it?
Can you supply a self-contained, working example?
Theoretically :) did you try Basic debugging checklist?
I think this ought to show what is wrong with my syntax
#!/usr/bin/perl --
use strict; use warnings; use Data::Dump;
my $str = "xxxx yy zzzzz xxxx qqq";
my $lengthy = sub { warn 1 };
my @list = grep $lengthy, split /\W/, $str;
dd \@list;
__END__
Its basically as if I wrote grep 1, ...
and here I thought grep knew to take a subroutine reference, it works with grep \&somename, but it has to be grep $lengthy->(),...
And on top of that no warnings of any kind, surprising | [reply] [d/l] |
|
|
Hi, when I run your code it produces nothing. Did I miss something?
Thanks for your advice,
nicemank.
| [reply] |
|
|
|