UPDATE: Read
Regexes are slow (or, why I advocate String::Index) for a detailed explanation of the general problem with regexes here.
There's no need to use a regex:
if (length($str) > 1000) {
substr($str, 1+rindex($str, '.', 1000)) = "";
}
See the
rindex function's docs. It returns the last location of the substring (here, ".") in the string (
$str). We're telling it to start looking at the 1000th character (and work backwards).
If you want to allow various punctuation, might I suggest my String::Index module? It's faster than the typical regex solution and a hybrid regex/substr solution.
#!/usr/bin/perl
use Benchmark 'cmpthese';
use String::Index 'crindex';
my $str = "alphabet. alphabet! alphabet? " x 100;
cmpthese(-5, {
rcindex => sub {
my $x = $str;
substr($x, 1+crindex($str, ".!?", 1000)) = "";
},
regex => sub {
my $x = $str;
$x =~ s/^(.{1,999}[.!?]).*/$1/;
},
rxsubstr => sub {
my $x = $str;
$x =~ /^.{1,999}[.!?]/ and substr($x, $+[0]) = "";
},
});
__END__
Rate regex rxsubstr rcindex
regex 42520/s -- -43% -66%
rxsubstr 75202/s 77% -- -40%
rcindex 125559/s 195% 67% --
String::Index gives you four functions that are crosses between Perl's
index() function and C's
strpbrk() function.
(I need to fix the docs or the module a tad. The function is 'crindex', but I have 'rcindex' somewhere.)
_____________________________________________________
Jeff
[japhy]Pinyan:
Perl,
regex,
and
perl
hacker, who'd like a
job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.