Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

match sequences of words based on number of characters

by nicemank (Novice)
on Feb 17, 2013 at 18:10 UTC ( #1019179=perlquestion: print w/ replies, xml ) Need Help??
nicemank has asked for the wisdom of the Perl Monks concerning the following question:

I want to extract sequences of words according to how many characters in each word.

So I want to extract for instance a sequence based on the number of characters (here defined as letters of the alphabet - not punctuation, numbers, white space).

For instance: I want sequences of 2, 4 and 3 character words - in that order only (but it could be any numbers of characters in any order I choose).

Say my text is: "xxxx yy zzzzz xxxx qqq"

I should extract the sequence: "yy xxxx qqq"

and keep on doing it. So from "xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq"

I should extract

"yy xxxx qqq yy xxxx qqq"

my $string = "xxxx yy zzzzz xxxx qqq"; my @array = ( $string =~ /(\b..?\b) (\b....?\b) (\b...?\b)/sg ); print @array; # produces nothing. # I have also tried rewriting it without success: it may # produce results, but not the right ones! (not the exact # sequence) # also if the string were longer it should produce # the sequence repeated: # "xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq" # should produce "yy xxxx qqq yy xxxx qqq" etc etc # until we run out of text.

I have also tried running adaptions of remiah's code, but without success: http://www.perlmonks.org/?node_id=996670. The problem/task differs and I cannot adapt the code to it. Inability! nicemank thanks in advance!

Comment on match sequences of words based on number of characters
Download Code
Re: match sequences of words based on number of characters
by Anonymous Monk on Feb 17, 2013 at 18:22 UTC

    add use re 'debug'; to see what the regex engine is doing, and why your match fails

    Then, use my @list = grep $lengthy, split /\W/, $str

    Where

    my $lengthy = do { my @lengths = ( 2, 6, 6 ); my $lix = 0; sub { if( $lix < @lis and $lengths[ $lix ] == length $_ ){ $lix++; return !!1; } return !!0; } };
    or whatever counting logic you require

      I tried a few variations and I can't seem to make that work. Can you supply a self-contained, working example?

        Hi, when I run your code it produces nothing. Did I miss something? Thanks for your advice, nicemank.

        I tried a few variations and I can't seem to make that work.

        Show your efforts :)

        What do you think is wrong with it?

        Can you supply a self-contained, working example?

        Theoretically :) did you try Basic debugging checklist?

        I think this ought to show what is wrong with my syntax

        #!/usr/bin/perl -- use strict; use warnings; use Data::Dump; my $str = "xxxx yy zzzzz xxxx qqq"; my $lengthy = sub { warn 1 }; my @list = grep $lengthy, split /\W/, $str; dd \@list; __END__

        Its basically as if I wrote grep 1, ...

        and here I thought grep knew to take a subroutine reference, it works with grep \&somename, but it has to be grep $lengthy->(),...

        And on top of that no warnings of any kind, surprising

Re: match sequences of words based on number of characters
by AnomalousMonk (Monsignor) on Feb 17, 2013 at 19:56 UTC

    Another way:

    >perl -wMstrict -le "my $s = 'aaaa bb ccccc ddd eeeeeee ffff gg hhhhh iii jjjjjjj'; ;; for my $ar ([2, 5, 3], [3, 7, 4], [4, 2],) { my $rx = rxg(@$ar); print $rx; my @groups = $s =~ m{ ($rx) }xmsg; print qq{'$_' } for @groups; } ;; sub rxg { my ($rx) = map qr{ \b $_ \b }xms, join ' \s+ ', map qq{\\w{$_}}, @_ ; ;; return $rx; } " (?^msx: \b \w{2} \s+ \w{5} \s+ \w{3} \b ) 'bb ccccc ddd' 'gg hhhhh iii' (?^msx: \b \w{3} \s+ \w{7} \s+ \w{4} \b ) 'ddd eeeeeee ffff' (?^msx: \b \w{4} \s+ \w{2} \b ) 'aaaa bb' 'ffff gg'
      Thanks for your kind efforts here. But I have tried running this but it produces an error: "Undefined subroutine &main::rxg called at whatnot.pl line 9". You might be assuming something; but have I missed it....? nicemank.
        ... "Undefined subroutine &main::rxg called at whatnot.pl line 9".

        I don't know what's in whatnot.pl, but somewhere in there must be the subroutine definition for  rxg() that I included in my post above; please take another look.

      Based on the examples, I don't believe that nicemank is requiring captured words to be adjacent. Maybe change \s+ to some non-greedy length of characters.
        Based on the examples, I don't believe that nicemank is requiring captured words to be adjacent.

        Hmmm... After taking another look at the OP, I think you may be right. In which case:

        >perl -wMstrict -le "my $s = 'xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq'; ;; for my $ar ([2, 4, 3], [5, 3]) { my $rx = rxg(@$ar); print $rx; my @groups = $s =~ m{ $rx }xmsg; print qq{'$_'} for @groups; } ;; sub rxg { my ($rx) = map qr{ \b $_ \b }xms, join ' \b .+? \b ', map qq{\\w{$_}}, @_ ; ;; return $rx; } " (?^msx: \b \w{2} \b .+? \b \w{4} \b .+? \b \w{3} \b ) 'yy zzzzz xxxx qqq' 'yy zzzzz xxxx qqq' (?^msx: \b \w{5} \b .+? \b \w{3} \b ) 'zzzzz xxxx qqq' 'zzzzz xxxx qqq'

        Update: No, darn it, that's still not right! nicemank seems to want  'yy xxxx qqq' from  'yy zzzzz xxxx qqq'. Oh, well...

Re: match sequences of words based on number of characters
by frozenwithjoy (Curate) on Feb 17, 2013 at 21:37 UTC
    Like this?
    perl -E ' my $string = "xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq"; my @array = $string =~ /\b(\w{2})\b.+?\b(\w{4})\b.+?\b(\w{3})\b/g; say "@array"; ' yy xxxx qqq yy xxxx qqq
    Edit: here is an approach that lets you auto-customize the regex.
    #!/usr/bin/env perl use strict; use warnings; use feature 'say'; my $regex = build_regex( 2, 4, 3 ); say "Regex: $regex"; my $string = "xxxx yy zzzzz xxxx qqq xxxx yy zzzzz xxxx qqq"; my @match = $string =~ /$regex/g; say "Match: @match"; sub build_regex { my ( $first, @others ) = @_; my $regex = qr{\b(\w{$first})\b}; $regex .= qr{.+?\b(\w{$_})\b} for @others; return $regex; } __END__ Regex: (?^:\b(\w{2})\b)(?^:.+?\b(\w{4})\b)(?^:.+?\b(\w{3})\b) Match: yy xxxx qqq yy xxxx qqq
Re: match sequences of words based on number of characters
by Kenosis (Priest) on Feb 18, 2013 at 01:31 UTC

    As another option, here's a subroutine to which you send a string and a list of lengths. Matches are returned as an array of arrays (AoA), and the array's empty if nothing's matched:

    use strict; use warnings; my $str = 'xxxx yy zzzzz xxxx qqq xxxx vv zzzzz wwww ppp'; my @lengths = qw/2 4 3/; my @seqs = getSequences( $str, @lengths ); print "@$_\n" for @seqs; sub getSequences { my ( $string, @lengths ) = @_; my ( $i, @sequences ) = 0; my $re = join '\b.+?\b', map { $i++; "(?<C$i>[a-z]{$_})" } @length +s; push @sequences, [ map $+{"C$_"}, 1 .. @lengths ] while $string =~ /\b$re\b/ig; return @sequences; }

    Output:

    yy xxxx qqq vv wwww ppp

    You mentioned only letters, so [a-z] was used in the regex. However, you may use \\w instead, if that works better for you. Of course, sending the subroutine different lists of lengths produces different results, as the regex is dynamically built.

      And Kenosis's code also works (unajacent words if needed, any combination of characters). thanks to you,

      nicemank.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1019179]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2014-07-23 00:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (130 votes), past polls