Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Truncating Last Sentence

by FatDog (Beadle)
on May 14, 2004 at 19:28 UTC ( [id://353465]=perlquestion: print w/replies, xml ) Need Help??

FatDog has asked for the wisdom of the Perl Monks concerning the following question:

How do I truncate the last/partial fragment of a sentance from a paragraph? (strip all the characters after the last "." in the string)?

I have several million long-text descriptions that I need to truncate to 1000 characters. This often leaves me with a block of text that looks like:

"...one of his best tracks. It's good. His other notew"

I want to identify these fragments and remove the partial sentence or all chars past the last period to get this:

"...one of his best tracks. It's good."

if Len($lRow) > 1000 { $lRow = substr($lRow, 0, 1000); $lRow =~ tr/\. xxxx$//; # here is where I have trouble }
I have used "Split" to create an array based on "." chars, then truncated the last member off and re-joined but this adds a 20 fold increase in processing time.

I also know I have to be careful of greedy pattern matching but I am un-sure how to use the non-greedy "+?" regreps.

Replies are listed 'Best First'.
Re: Truncating Last Sentence
by japhy (Canon) on May 14, 2004 at 20:38 UTC
    UPDATE: Read Regexes are slow (or, why I advocate String::Index) for a detailed explanation of the general problem with regexes here.
    There's no need to use a regex:
    if (length($str) > 1000) { substr($str, 1+rindex($str, '.', 1000)) = ""; }
    See the rindex function's docs. It returns the last location of the substring (here, ".") in the string ($str). We're telling it to start looking at the 1000th character (and work backwards).

    If you want to allow various punctuation, might I suggest my String::Index module? It's faster than the typical regex solution and a hybrid regex/substr solution.

    #!/usr/bin/perl use Benchmark 'cmpthese'; use String::Index 'crindex'; my $str = "alphabet. alphabet! alphabet? " x 100; cmpthese(-5, { rcindex => sub { my $x = $str; substr($x, 1+crindex($str, ".!?", 1000)) = ""; }, regex => sub { my $x = $str; $x =~ s/^(.{1,999}[.!?]).*/$1/; }, rxsubstr => sub { my $x = $str; $x =~ /^.{1,999}[.!?]/ and substr($x, $+[0]) = ""; }, }); __END__ Rate regex rxsubstr rcindex regex 42520/s -- -43% -66% rxsubstr 75202/s 77% -- -40% rcindex 125559/s 195% 67% --
    String::Index gives you four functions that are crosses between Perl's index() function and C's strpbrk() function.

    (I need to fix the docs or the module a tad. The function is 'crindex', but I have 'rcindex' somewhere.)

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
Re: Truncating Last Sentence
by Enlil (Parson) on May 14, 2004 at 19:35 UTC
    instead of
    $lRow =~ tr/\. xxxx$//; # here is where I have trouble
    Try:
    $lRow =~ s/\.[^.]+$/./;

    -enlil

Re: Truncating Last Sentence
by sacked (Hermit) on May 14, 2004 at 19:36 UTC
    You'll want to use the substitution operator s///, not the transliteration operator tr///. See perlop for details. I would suggest something like the following:
    if( length($lRow) > 1000 ) { $lRow = substr($lRow, 0, 1000); $lRow =~ s/(?<=\.)[^.]*$//; }

    --sacked
Re: Truncating Last Sentence
by Belgarion (Chaplain) on May 14, 2004 at 19:34 UTC

    The following code seems to work for me:

    my $s = "one of his best tracks. It's good. His other notew"; $s =~ s/([.?!]).*$/$1/; print $s, "\n"; __OUTPUT__ one of his best tracks.

    This handles sentences ending with a period, question mark, or exclamation point. It does not handle quotations properly (like, ...around here."), but I'm sure you could extend it to handle those cases.

    Update Damn, too fast at the keyboard. If I looked more closely at the OUTPUT I would have noticed that my code doesn't work.

    Update II Fixed the regex. It should read:

    my $s = "one of his best tracks. It's good. His other notew"; $s =~ s/([.?!])[^.?!]*$/$1/; print $s, "\n"; __OUTPUT__ one of his best tracks. It's good.

    Sorry for the earlier mistake.

Re: Truncating Last Sentence
by FatDog (Beadle) on May 14, 2004 at 21:00 UTC
    Fantastic - several ways to solve my problem (I forgot about RIndex)!

    I have decided to use: $lRow =~ s/\.[^.]+$/./;

    Let me try to decode the RegEx:

    Look for a period followed by OPTIONAL number of non-period characters followed by the EOL char. Substitue all of these with a single period.

    The [^.] solves the greedy problem by making sure there is some non-period characters between things. Thanks!

      Actually, s/\.[^.]+$/./; says to match a period followed by one or more non-periods anchored at the end of a line. See perlre for more information.

      Isn't this going to break horribly if the sentence being discarded is about the band Mr. Mister? Or the Orbital song "Dr. Who"? Maybe Lingua::EN::Sentences would help. From the POD:

      The Lingua::EN::Sentence module contains the function get_sentences, which splits text into its constituent sentences, based on a regular expression and a list of abbreviations (built in and given).

      Certain well know exceptions, such as abreviations, may cause incorrect segmentations. But some of them are already integrated into this code and are being taken care of. Still, if you see that there are words causing the get_sentences() to fail, you can add those to the module, so it notices them.

      --
      Spring: Forces, Coiled Again!
Re: Truncating Last Sentence
by qq (Hermit) on May 14, 2004 at 23:12 UTC

    Are you sure you want to do this?

    It seems like you should really handle sentances that end with other punctuation, or quoted sentances, etc. But the problem becomes rapidly more complicated when you try.

    A simpler option is just to append ... to the truncated string, or at most to always cut off the last word in case its partial. People know whats going on when they see a truncated sentance and ..., whereas if you silently cut off something you may change the meaning without giving any clue that you've done so.

    qq

      The use of 'dot dot dot' (aka elipsis?) is also a common convention used in many contexts, so this is a good suggestion. You should consider to use this whenever you truncate text in order to fit it in a confined area.

Re: Truncating Last Sentence
by Enlil (Parson) on May 14, 2004 at 20:00 UTC
    Another way to do it.
    if ( length($lRow) > 1000 ) { $lRow =~ s/(.{1,999}\.).*/$1/ };
    Note that this as well as most of the others assume there is at least one period in the string (else the matches will fail).

    -enlil

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://353465]
Approved by kutsu
Front-paged by calin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-16 04:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found