Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Problem with a text-parsing regex

by ibm1620 (Hermit)
on May 07, 2022 at 19:04 UTC ( [id://11143647]=perlquestion: print w/replies, xml ) Need Help??

ibm1620 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I'm having difficulty with a regexp to split English text into the sort of elements I need.

Original plan was to chop up lines of text into whitespace-separated chunks, and separate out leading and trailing punctuation into separate variables, producing three values: $pre, $word, and $post. $post's final character would be the whitespace character separating it from the next chunk.

Several complications: I want to allow a "word" to be a hyphenated term (two-fer, Bob's-yer-uncle, will-o'-the-wisp); I want to allow embedded apostrophes (o'clock, it's); and I want to treat two or more hyphens in a row as equivalent to a whitespace character that separates the chunks.

The following almost works the way I want it to. I've noted where it fails. I can generally see what causes a failure, but fixing it always breaks something else.

As always, thanks for your generous help!

#!/usr/bin/env perl use 5.010; use warnings; use strict; my $n; # line no while (my $x = <DATA>) { chomp $x; say $x; while ( $x =~ m/ ([[:punct:]]*) # $1: leading punct marks ( # $2: a "word" consisting of (?: [[:word:]']+ - )* # optional segments with # embedded {'}s ending with # single {-} [[:word:]]+ # and ending in pure word characters ) ([[:punct:]]* \ ? ) # $3: trailing punct marks ending # with space (except at end of # line?) /xxg ) { printf " %3s {%s|%s|%s}\n", ++$n, # make whitespace visible map {(my $y = $_ // '') =~ tr/ /_/; $y} $1, $2, $3; } } __DATA__ "'Uncouth' about sums it up." The word they will use is 'uncouth'. "It's the old story." It's a will-o'-the-wisp--a two-fer--and Bob's-yer-uncle at four o'cloc +k. It's two o'clock--time for a nap. Remember 45's? What about (this)? [Editor's note: blah blah] and so on... A ... and B I said--"What's the expression?"
Output:
"'Uncouth' about sums it up." 1 {"'|Uncouth|'_} 2 {|about|_} 3 {|sums|_} 4 {|it|_} 5 {|up|."} The word they will use is 'uncouth'. 6 {|The|_} 7 {|word|_} 8 {|they|_} 9 {|will|_} 10 {|use|_} 11 {|is|_} 12 {'|uncouth|'.} "It's the old story." 13 {"|It|'} <- should be {"|It's|_} 14 {|s|_} 15 {|the|_} 16 {|old|_} 17 {|story|."} It's a will-o'-the-wisp--a two-fer--and Bob's-yer-uncle at four o'cloc +k. 18 {|It|'} <- same problem 19 {|s|_} 20 {|a|_} 21 {|will-o'-the-wisp|--} <- perfect! 22 {|a|_} 23 {|two-fer|--} 24 {|and|_} 25 {|Bob's-yer-uncle|_} 26 {|at|_} 27 {|four|_} 28 {|o|'} <- should be {|o'clock|.} 29 {|clock|.} It's two o'clock--time for a nap. 30 {|It|'} 31 {|s|_} 32 {|two|_} 33 {|o|'} <- should be {|o'clock|--} 34 {|clock|--} 35 {|time|_} 36 {|for|_} 37 {|a|_} 38 {|nap|.} Remember 45's? 39 {|Remember|_} 40 {|45|'} <- 41 {|s|?} What about (this)? 42 {|What|_} 43 {|about|_} 44 {(|this|)?} [Editor's note: blah blah] and so on... 45 {[|Editor|'} <- 46 {|s|_} 47 {|note|:_} 48 {|blah|_} 49 {|blah|]_} 50 {|and|_} 51 {|so|_} 52 {|on|...} A ... and B 53 {|A|_} <- correct to omit detached elipsis 54 {|and|_} 55 {|B|} I said--"What's the expression?" 56 {|I|_} 57 {|said|--"} <- should be {|said|--} 58 {|What|'} <- should be {"|What's|_} 59 {|s|_} 60 {|the|_} 61 {|expression|?"}

Replies are listed 'Best First'.
Re: Problem with a text-parsing regex
by hv (Prior) on May 07, 2022 at 20:13 UTC

    Here's one approach to solving the first problem: handling both "it's" and "will-o'-the-wisp":

    ( # $2: a "word" consisting of one or more o +f (?: [[:word:]] # a word character | # or hyphen, quote, or both # with word characters before and afte +r (?<= [[:word:]] ) (?: ' | - | '- | -' ) (?= [[:word:]] ) )+ )

    For the double-hyphen, the easy solution is to replace it with space before parsing. The harder solution is to disallow it within the [[:punct:]]*, something like:

    # any punctuation excluding "-" # or "-" that is neither preceded nor followed by itself (?: (?!-) [[:punct:]] | (?<!-) - (?!-) )*

    With those two changes, I _think_ it passes all your test cases.

    With a sufficiently recent perl, the experimental regex_sets feature should let you construct "any punctuation except hyphen" directly as a character class, which would be more efficient than /(?!-) [[:punct:]]/. I haven't yet worked out how to do that though - it's made harder by the special nature of '-' in character classes, doubly-special in char class arithmetic.

      ... "any punctuation except hyphen" ...

      This can be expressed without experimental features by a "double-negative" character class trick:

      class of all characters that are [^-[:^punct:]] ^ ^ | | | +--- and also not a not-punct (i.e., or is a [:punct:]) | +--- not a hyphen
      Win8 Strawberry 5.8.9.5 (32) Sat 05/07/2022 18:36:51 C:\@Work\Perl\monks >perl use strict; use warnings; for my $char (split '', '#%-&*') { printf "'%s' %smatch \n", $char, $char =~ m{ \A [^-[:^punct:]] \z }xms ? '' : 'NO ' ; } ^Z '#' match '%' match '-' NO match '&' match '*' match
      See perlrecharclass.

      Update: The double-negative trick also works with "traditional" \s \d \w etc. character classes that have complements. E.g., the pattern "any word (\w) character except an underscore" can be defined as [^_\W].


      Give a man a fish:  <%-{-{-{-<

      Thank you -- I think you've nailed it.

      I'd never thought about using a character-at-a-time approach as you did to handle the first problem. I just assumed it would be much less efficient than trying to use [[:word:]]+, for example. But there's probably no basis for that assumption. (Premature optimization!) That could make it easier in the future for me to tackle these complicated scenarios.

      I use v5.34.1, and will take a look at regex_sets.

        I'd never thought about using a character-at-a-time approach as you did to handle the first problem. I just assumed it would be much less efficient than trying to use [[:word:]]+, for example.

        It will be less efficient - but I would always recommend solving the problem first, and worrying about optimization second.

        In the general case, a regular expression that has to invoke more regops (regexp operations) will usually be slower than one that invokes fewer; but the cost will be less than invoking more ops at the perl level.

Re: Problem with a text-parsing regex
by tybalt89 (Monsignor) on May 08, 2022 at 18:12 UTC

    Different way to handle --

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11143647 use warnings; while (my $x = <DATA>) { chomp $x; print "\n$x\n"; my $out = ''; while ( $x =~ m/ ([[:punct:]]*) # $1: leading punct marks ( # $2: a "word" consisting of [[:word:]]+ # word (?: (?: '-? | - ) [[:word:]]+ # and ending in pure word characters )* ) ( (?: --+ | [[:punct:]]* ) \ ? ) # $3: trailing punct marks +ending # or multi-dashs # with space (except at end of # line?) /xxg ) { $out .= sprintf "{%s|%s|%s} ", # make whitespace visible map {(my $y = $_ // '') =~ tr/ /_/; $y} $1, $2, $3; } print "$out\n" =~ s/ $//r =~ s/.{65}\K /\n/gr; } __DATA__ "'Uncouth' about sums it up." The word they will use is 'uncouth'. "It's the old story." It's a will-o'-the-wisp--a two-fer--and Bob's-yer-uncle at four o'cloc +k. It's two o'clock--time for a nap. Remember 45's? What about (this)? [Editor's note: blah blah] and so on... A ... and B I said--"What's the expression?"

    Outputs (changed to be able to see all output without scrolling):

    "'Uncouth' about sums it up." {"'|Uncouth|'_} {|about|_} {|sums|_} {|it|_} {|up|."} The word they will use is 'uncouth'. {|The|_} {|word|_} {|they|_} {|will|_} {|use|_} {|is|_} {'|uncouth|'.} "It's the old story." {"|It's|_} {|the|_} {|old|_} {|story|."} It's a will-o'-the-wisp--a two-fer--and Bob's-yer-uncle at four o'cloc +k. {|It's|_} {|a|_} {|will-o'-the-wisp|--} {|a|_} {|two-fer|--} {|and|_} {|Bob's-yer-uncle|_} {|at|_} {|four|_} {|o'clock|.} It's two o'clock--time for a nap. {|It's|_} {|two|_} {|o'clock|--} {|time|_} {|for|_} {|a|_} {|nap|.} Remember 45's? {|Remember|_} {|45's|?} What about (this)? {|What|_} {|about|_} {(|this|)?} [Editor's note: blah blah] and so on... {[|Editor's|_} {|note|:_} {|blah|_} {|blah|]_} {|and|_} {|so|_} {|on|. +..} A ... and B {|A|_} {|and|_} {|B|} I said--"What's the expression?" {|I|_} {|said|--} {"|What's|_} {|the|_} {|expression|?"}

    Did get every thing right?

      The instructions were that "--" was to be treated like a space, so presumably should not be part of the punctuation runs - I think it should print {|o'clock|}, for example, not {|o'clock|--}.

      I do prefer your version of the word parsing to mine, but I suspect (?: '-? | -'? ) is what's intended. (There aren't any examples of "word-'word" in the test cases though - I could probably come up with one in Dutch, but I imagine they're pretty rare in English.)

        I grepped my collection of text files (all English-language downloads from gutenberg.org) for -' and only found forty-'leven and fellow-'prentice. I've updated tybalt89's solution with your improvement.

        The contents of $3 will contain a final space if one is present, so {|o'clock|--} is consistent with the instructions.

      Yes! Beautiful. Makes perfect sense. Thank you.

      Your final print is a real head-scratcher, but I like what it does and I'll noodle on it some more....

Re: Problem with a text-parsing regex
by Fletch (Bishop) on May 07, 2022 at 20:05 UTC

    Rather than trying to roll your own regex (and depending on what you're trying to do with this next) you probably want to look at CPAN and search for NLP modules (Natural Language Processing) instead. Those are likely going to do what you want WRT removing not-words as well as being able to give you more info about the words/tokens it extracts.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Oddly enough, I found almost nothing of use in CPAN related to NLP! That's very surprising. (See, for example, https://metacpan.org/pod/Text::NLP)

      In any event, there's more idiosyncratic processing of the words and surrounding punctuation, so it's doubtful that any CPAN module would exactly fit my needs.

Re: Problem with a text-parsing regex
by AnomalousMonk (Archbishop) on May 07, 2022 at 22:54 UTC

    I would have thought your first step would have been to write a unit test (see Test::More and friends) specifying exactly what you want to parse from things like "'Uncouth' about sums it up.", "It's the old story.", etc.


    Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11143647]
Approved by davies
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-19 14:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found