http://www.perlmonks.org?node_id=11141609


in reply to Splitting multiline string into words, the stuff between words, and newlines

You can also use split for that in order to not require a regular expression for matching non words:
my @fragments = grep length, split /(\b{wb}.+?\b{wb}|\n+)/, $book;
So, you get words, sequences of new lines and then everything else.
  • Comment on Re: Splitting multiline string into words, the stuff between words, and newlines
  • Download Code

Replies are listed 'Best First'.
Re^2: Splitting multiline string into words, the stuff between words, and newlines
by ibm1620 (Hermit) on Feb 24, 2022 at 12:50 UTC
    This looks to me like it should work, but it splits the strings of non-words into separate characters!

    "For example ...\n" -> {For}{_}{example}{_}{.}{.}{.}{$}
      That is because \b{wb} matches between those signs.

      This seems to solve the issue:

      my @fragments = grep length, split /(\b{wb}\w.*?\b{wb}|\n+)/, $book;

      But my knowledge of Unicode and the \b{wb} semantics is rather limited so that may have other issues.

        Not sure 'cause that's 'bout words also including non \w characters.

        And some of 'em even start on apostrophe ;)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        For my purposes, this is fine. I'm mainly interested in capturing possessives and contractions.