Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: RE question: Sentence with a minimum length

by salva (Abbot)
on Oct 06, 2008 at 09:02 UTC ( #715532=note: print w/replies, xml ) Need Help??


in reply to RE question: Sentence with a minimum length

You can use a look-ahead assertion (see perlre) to ensure that there are at least two words in the sentence:
/\s*(?=\w+\s+\w+)[\w\s]{49,}\w/

Replies are listed 'Best First'.
Re^2: RE question: Sentence with a minimum length
by moritz (Cardinal) on Oct 06, 2008 at 09:22 UTC
    lima1 wants minimal matches, so the {49,} should actually be {49,}?, in which case it stops working.

    The reason is that the look-ahead is not limited to what the [\w\s]{49,}? matches. A small demonstration:

    #!/usr/bin/perl use strict; use warnings; my $re = qr{^\s*(?=\w+\s+\w+)[\w\s]{49,}?\w}; my $str = ('x' x 65) . ' x'; if ($str =~ m/$re/) { print $&, $/; } __END__ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    You see that match only includes a long word, not a sentence. There is a sentence present, but it's not matched.

    Update: In Perl 6 you could use & like this:

    regex short_word { \s* [ & \w+ \s+ \w+ & .**?{1..50} ] }
      Well, I am not a native English speaker, but as I understand it...
      I want to match sentences (everything that is not just a long word) of a minimum length
      ...doesn't mean that the length of the sentence has to be minimal but that the length of the sentence has to be equal or bigger than a $minimum_length.

      Anyway, if a minimal match is what you want, it still can be done with a regexp (without embedded code!), though a complicated one:

      sub make_re { my $len = shift; $len > 3 or die "len <= 3"; my $re = "\\b" . join('|', map("\\w(?:\\s[\\s\\w]{$_,}?", reverse 1 .. ( +$len - 3)), "\\w+\\s+") . (")" x ($len - 3)) . "\\w+"; warn "re: /$re/\n"; return qr/($re)/; } my $re = make_re(5); while(<DATA>) { print "$1\n" if $_ =~ $re; } __DATA__ foo foo foooooooo foooooooo fooo foo foo foo foo foo foo foooooooo foo foo foo f foo fo fo foo f fo foo f fo
        I was curious about the efficiency of the generated regular expression from my previous post. I run some benchmarks, and the results are somewhat unexpected, at least for me!:
        my $len = 300; my @lines; push @lines, join('', (map { 'f' . ('o' x rand $len * 1.5), (rand > .8 + ? '. ' : ' ') } 0..rand 20 ), "\n") for 0 .. 1000; sub make_re { my $len = shift; my $re = "\\b" . join('|', map("\\w(?:\\s[\\s\\w]{$_,}?", reverse 1 .. ( +$len - 3)), "\\w+\\s+") . (")" x ($len - 3)) . "\\w+"; qr/($re)/; } # match maximal length sentence my $len_minus_two = $len - 2; sub max { my @m = grep /\s*\b\w(?=\w*\s+\w+)[\w\s]{$len_minus_two,}\w/ +o, @lines } # match minimal length sentence my $re = make_re $len; sub min { my @m = grep /$re/, @lines } use Benchmark qw(cmpthese); cmpthese(-1, { max => \&max, min => \&min } ); __OUTPUT__ Rate max min max 24.3/s -- -26% min 33.0/s 36% --
        Note that the two regexps used match different things.
Re^2: RE question: Sentence with a minimum length
by lima1 (Curate) on Oct 06, 2008 at 09:08 UTC
    Ah, nice. Seems to work perfectly! Thank you very much...

    Update:

    ++moritz. But that's still ok for me, because it filters my problematic case (just a long word). That's all I need here.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://715532]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2020-07-02 08:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?