Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
There's more than one way to do things
 
PerlMonks  

Re: Extracting paragraphs from html

by fraktalisman (Hermit)
on Sep 11, 2005 at 12:55 UTC ( [id://491073]=note: print w/replies, xml ) Need Help??

This is an archived low-energy page for bots and other anonmyous visitors. Please sign up if you are a human and want to interact.


in reply to Extracting paragraphs from html

If you can't rely on certain tags (and I agree that you can't), the question is, what is the definition of a paragraph?

Where does it stop? Certainly not at a newline, for we are dealing with HTML, and there might be many newlines in the source code where they are not visible in the page that is actually displayed.
So what would possibly terminate a paragraph?

  • A closing tag of a block element, like </div> </p> etc.
  • More than one break, i.e. <br> <br> without words or images between them
  • The start of another paragraph or block element, like <div> <p> <iframe> <hr> etc.
  • An image <img>
  • The end of the page or document

And for a pragmatic approach, you might want to specify a maximum length at which the given text is truncated. There are people who don't use paragraphs at all, they just type or copy hundreds and thousands of words on a page, like they were writing a novel or like they haven't understood the necessity of formatting at all.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://491073]
help
Sections?
Information?
Find Nodes?
Leftovers?
    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.