Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Please read How do I post a question effectively? In particular, note that you should be providing desired output as well as some code that didn't work for you. I honestly have no idea what you mean by "splitting based on number of words, sentences or letters". If you can't write it in code, write it in pseudo-code and be explicit about your algorithm. The more specificity you can provide, the more inclined people will be to help and the better the help will be.

The general challenge you describe is not easily solved, since English is chock full of idioms and peculiarities. Given the assigned spec, I would probably split on one or more whitespace characters that are preceded by periods, question marks or exclamation points but not preceded by a title (Mr., Dr., Mrs., Ms., esq., ...). This is by no means comprehensive, but it should get you through this task. Read perlreftut and see if you can translate the above spec into a regular expression. Of particular interest should be Looking ahead and looking behind. Alternatively, you could just simply split with /\.\s+/ and then stitch entries back together if there's a trailing title.

How do you take paragraph or large amount of text and break it into sentences (perferably using Ruby)...
I think perhaps you've come to the wrong community. You should stay anyway, though, since we're pretty cool and generally helpful.

#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.


In reply to Re: Split a paragraph based on the number of letters by kennethk
in thread Split a paragraph based on the number of letters by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-03-29 12:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found