Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

stripped punctuation

by thealienz1 (Pilgrim)
on Oct 06, 2005 at 19:30 UTC ( #498023=perlquestion: print w/replies, xml ) Need Help??

thealienz1 has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I have written a basic word count script. It incoporates many options that can be used to determing a word, divied a word, remove punctuation, etc... It works, but I am afraid that its not efficient in one aspect.

As of right now, when a word is found in a document punctuation that appears at the beginning and end of the word can be removed. With the user define what punctuation they want removed in a string divided by a space.

  • "Wilmer!" -> Wilmer

#!/usr/bin/perl my $stripped = '" !'; my $word = '"Wilmer!"'; foreach my $punc (split(/ / , $stripped)) { #print "punc: $punc\n"; $word =~ s/^$punc//; $word =~ s/$punc$//; } print $word;

I want to know what other monks would do. What is the most efficient method that you can think of? Or perhaps lead me into the right direction, so that I can right the code myself because it is my project and not yours.

Replies are listed 'Best First'.
Re: stripped punctuation
by Skeeve (Vicar) on Oct 06, 2005 at 19:45 UTC
    My basic approach would be to remove all punctuation, or better: All non-word characters and start counting then:
    s/[^a-z]/ /g; @words= split ' ';
    Of course this does not take into account:
    1. words that are broken at a line-
      end ;-)
    2. foreign language characters
    3. your definition of a word. Maybe you render ab4711xya word. With this it will be 2.

    $\=~s;s*.*;q^|D9JYJ^^qq^\//\\\///^;ex;print
      This also breaks for simple, commonly used words. Like "don't." That's two words for you right there. Or "co-worker." Or...

      I'd suggest that the simplistic approach would be better off by splitting on white space first, then simply removing non-word characters (and I'd use the [:alpha:] designation as opposed to a-z).

Re: stripped punctuation
by ambrus (Abbot) on Oct 06, 2005 at 20:58 UTC

    Basically, you have to take whitespace-separated words, removing non-alphanumeric (i.e. punctation) characters from the beginning or the end. Punctation is allowed inside a word. What makes this a bit more complicated is that in some languages, the space os omitted around em dashes, so you have to consider em-dashes as whitespace too. Thus, it's easier to define what separates words than what counts as a word. Here's a simple solution.

    #!perl use warnings; use strict; my # I hate to say this, as it's not mine $text = q{ "'My Dear Mr. Sherlock Holmes:--"Lord Backwater tells me that I may place implicit reliance upon your judgement and discretion. I have determined, therefore, to call upon you and to consult you in referenc +e to the very painful event which has occurred in connection with my wedding. Mr. Lestrade, of Scotland Yard, is acting already in the matt +er, but he assures me that he sees no objection to your co-operation, and that he even thinks that it might be of some assistance. I will call at four o'clock in the afternoon, and, should you have any other engagement at that time, I hope that you will postpone it, as this matter is of paramount importance. Yours faithfully, ST. SIMON.' -- Sir Arthur Conan Doyle, The Adventures of Sherlock Holmes. 1892. }; use locale; my $stuff_between_words = qr{ [^[:alnum:][:space:]]* (?: [[:space:]]+ | --+ | ^ ) [^[:alnum:]]* }x; my @words = split $stuff_between_words, $text; shift @words; # remove the empty word from the beginning print join(" ", map "<<$_>>", @words), "\n"; __END__

    However, if I'd really need a fast solution, I'd use flex.

    Update: on second thought, it'd be better to use just -- instead of --+.

      Thanks I will take a look at flex.
Re: stripped punctuation
by GrandFather (Sage) on Oct 06, 2005 at 19:42 UTC
    use strict; use warnings; my $stripped = qr/["!]/; my $word = '"Wilmer!"'; $word =~ s/^(?:$stripped)+(.*?)(?:$stripped)+$/$1/; print $word;

    Perl is Huffman encoded by design.

      After looking at your regexp I took to simplifying my needs with:

      $word =~ s/^[^\w\d]+(.*?)[^\w\d]+$/$1/;

      My intention is remove everything that is not a letter or number up to the first letter, pull everything up till the last non letter or digit. When I look at it it makes sense, but my testing it does not work.

      Update

      It works on the simple example I gave for 'Wilmer!'. I was running word count with a script as the input and the odd results I was seeing were the syntax in the script. I apologize.

        Except you want to strip punctuation from the beginning or end. The above regex only works if there is punctuation at both beginning and end.

        If removing any trailing/leading punctuation is in fact your goal, what about something like:

        use strict; use warnings; my $word = 'Wilmer",'; $word =~ s/^ \W*? # ignore any leading punc ( \w .*? ) # swallow everything lazily (?: \W+ )? $ # ignore any trailing punc /$1/x; print $word;

        Update: Mind you, at that point, a much simpler regex will likely serve you better in terms of speed and readability:

        $word =~ s/(?:^\W+)|(?:\W+$)//g;

        Final update - benchmark:

        Rate capture non_capture capture 16561/s -- -28% non_capture 22861/s 38% --

        The second suggestion is about 30% faster, on average.

        Additionally, \w doesn't mean what you think it means.

Re: stripped punctuation
by SamCG (Hermit) on Oct 06, 2005 at 20:00 UTC
    Update: Sorry, I realized after posting this is not entirely clear. I'd also strip the punctuation and then do the split on spaces, I just would probably think of the transliteration operator before the substitute (because I wouldn't necessarily be thinking regex), and I'd probably stick with it, thinking it might be faster. Does anyone know if it is?

    One possibility:
    #! perl use strict; $_="Wilmer'?"; tr/a-zA-Z//csd; print;
Re: stripped punctuation
by radiantmatrix (Parson) on Oct 06, 2005 at 21:17 UTC

    GNU's wc command exists for any major platform. Simply call it from inside your Perl application and let it do the heavy lifting.

    <-radiant.matrix->
    A collection of thoughts and links from the minds of geeks
    The Code that can be seen is not the true Code
    "In any sufficiently large group of people, most are idiots" - Kaa's Law
      This does a total word count to my understanding. I am looking for individual word counts. I did not mention that though, but thanks anyways.
Re: stripped punctuation
by graff (Chancellor) on Oct 07, 2005 at 00:52 UTC
    You said:
    the user define what punctuation they want removed
    Presumably, you would have a "default" set (e.g. whatever Perl defines as matching "\W", and maybe "_" as well), which would be suitable in most cases. Actually, a better default might be \P{L} which refers to "all non-letters" (see 'perldoc perlunicode').

    If it's important for your application to allow the user to specify a "cusomized" set for some particular case, you face a variety of tricky issues:

    • Is it easier for the user to specify the particular non-alphanumerics that should be kept, rather than specify all the ones that should be removed? (I think maybe so.)
    • Many of the characters involved have special meanings in regexes (period, question-mark, plus, dollar-sign and some others). This isn't a killer, but you need to be mindful of it.
    • If some "non-letter" characters are to be kept when they occur at a word boundary, you might have problems when other "non-letters" (that are to be discarded) co-occur with the ones being kept.

      For example, suppose that hyphen is to be kept at word boundaries, but parens at word boundaries should be removed; in a string like " word)- " it will be hard to remove the paren, because it lies "inside" the hyphen, which is being retained; that paren is "word-internal". Maybe the user just needs to specify which non-letter characters to keep when they occur next to a letter (or maybe it's more complicated than that).

    Anyway, in the default case, it really can be very simple (and this might even be the quickest):

    use strict; # make up some data my $line = "('The text.')-- 5 o'clock! What's cookin' with the text da +ta?"; # split on whitespace, keep only tokens that contain at least on lette +r my @words = grep /\p{L}/i, split ' ', $line; my %wcount; for ( @words ) { # using $_ will modify @words "in-place" s/^\P{L}+//; # remove initial non-letters s/\P{L}+$//; # remove final non-letters $wcount{lc()}++; # normalize to lower-case-only } print "$wcount{$_}\t$_\n" for ( sort keys %wcount ); __OUTPUT__ 1 cookin 1 data 1 o'clock 2 text 2 the 1 what's 1 with
    Note that even though I was using a regex symbol (\P{L}) that is documented as a "unicode" tool for regexes, I can use it on plain old ASCII data. (If you have non-ASCII data, make sure it's in utf8 before processing it -- see Encode if you have non-utf8, non-ASCII text data.)
Re: stripped punctuation
by thundergnat (Deacon) on Oct 07, 2005 at 01:31 UTC

    Some things which often fail to get taken into account:

    Words with internal apostrophes. (don't won't, can't shouldn't, you'll, it's, etc.)

    Words with non ASCII characters. á, ń, ˙, etc.

    This does:

    ########################################################## #! /usr/bin/perl use warnings; use strict; my $word = qr/(?<!\p{Alnum})\p{Alnum}+(?!\p{Alnum})/; my %count; while (<DATA>) { my $line = lc $_; while ($line =~ /($word('$word)?)/g){ $count{$1}++; } } printf "%15s %5d\n", $_, $count{$_} for sort keys %count; __DATA__ "Hello World!" "Oh poor Yorick, his world I knew well yes I did" "Words with internal apostrophes. (don't won't, can't shouldn't, you'l +l, it's, etc.)" "Seńor Montóya's resüme isn't ápropos."

      The ultimate case would be to handle a word count in multiple languages, btu I believe that there are too many factors to take into account. Thanks for the suggestion though.

Re: stripped punctuation
by Moron (Curate) on Oct 07, 2005 at 16:54 UTC
    Is it me or is it really as simple as:
    $word =~ s/_//g; $word =~ /(\w+)/ and $word = $1;
    Friday night (=rapidly evaporating) minds vaguely want to know!

    -M

    Free your mind

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://498023]
Approved by Happy-the-monk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (8)
As of 2019-10-21 20:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?