Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^2: RE question: Sentence with a minimum length

by moritz (Cardinal)
on Oct 06, 2008 at 09:22 UTC ( [id://715539]=note: print w/replies, xml ) Need Help??


in reply to Re: RE question: Sentence with a minimum length
in thread RE question: Sentence with a minimum length

lima1 wants minimal matches, so the {49,} should actually be {49,}?, in which case it stops working.

The reason is that the look-ahead is not limited to what the [\w\s]{49,}? matches. A small demonstration:

#!/usr/bin/perl use strict; use warnings; my $re = qr{^\s*(?=\w+\s+\w+)[\w\s]{49,}?\w}; my $str = ('x' x 65) . ' x'; if ($str =~ m/$re/) { print $&, $/; } __END__ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

You see that match only includes a long word, not a sentence. There is a sentence present, but it's not matched.

Update: In Perl 6 you could use & like this:

regex short_word { \s* [ & \w+ \s+ \w+ & .**?{1..50} ] }

Replies are listed 'Best First'.
Re^3: RE question: Sentence with a minimum length
by salva (Canon) on Oct 06, 2008 at 11:01 UTC
    Well, I am not a native English speaker, but as I understand it...
    I want to match sentences (everything that is not just a long word) of a minimum length
    ...doesn't mean that the length of the sentence has to be minimal but that the length of the sentence has to be equal or bigger than a $minimum_length.

    Anyway, if a minimal match is what you want, it still can be done with a regexp (without embedded code!), though a complicated one:

    sub make_re { my $len = shift; $len > 3 or die "len <= 3"; my $re = "\\b" . join('|', map("\\w(?:\\s[\\s\\w]{$_,}?", reverse 1 .. ( +$len - 3)), "\\w+\\s+") . (")" x ($len - 3)) . "\\w+"; warn "re: /$re/\n"; return qr/($re)/; } my $re = make_re(5); while(<DATA>) { print "$1\n" if $_ =~ $re; } __DATA__ foo foo foooooooo foooooooo fooo foo foo foo foo foo foo foooooooo foo foo foo f foo fo fo foo f fo foo f fo
      I was curious about the efficiency of the generated regular expression from my previous post. I run some benchmarks, and the results are somewhat unexpected, at least for me!:
      my $len = 300; my @lines; push @lines, join('', (map { 'f' . ('o' x rand $len * 1.5), (rand > .8 + ? '. ' : ' ') } 0..rand 20 ), "\n") for 0 .. 1000; sub make_re { my $len = shift; my $re = "\\b" . join('|', map("\\w(?:\\s[\\s\\w]{$_,}?", reverse 1 .. ( +$len - 3)), "\\w+\\s+") . (")" x ($len - 3)) . "\\w+"; qr/($re)/; } # match maximal length sentence my $len_minus_two = $len - 2; sub max { my @m = grep /\s*\b\w(?=\w*\s+\w+)[\w\s]{$len_minus_two,}\w/ +o, @lines } # match minimal length sentence my $re = make_re $len; sub min { my @m = grep /$re/, @lines } use Benchmark qw(cmpthese); cmpthese(-1, { max => \&max, min => \&min } ); __OUTPUT__ Rate max min max 24.3/s -- -26% min 33.0/s 36% --
      Note that the two regexps used match different things.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://715539]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-19 21:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found