Re: Extracting paragraphs from html

by fraktalisman (Hermit)
on Sep 11, 2005 at 12:55 UTC

in reply to Extracting paragraphs from html

If you can't rely on certain tags (and I agree that you can't), the question is, what is the definition of a paragraph?

Where does it stop? Certainly not at a newline, for we are dealing with HTML, and there might be many newlines in the source code where they are not visible in the page that is actually displayed.
So what would possibly terminate a paragraph?

  • A closing tag of a block element, like </div> </p> etc.
  • More than one break, i.e. <br> <br> without words or images between them
  • The start of another paragraph or block element, like <div> <p> <iframe> <hr> etc.
  • An image <img>
  • The end of the page or document

And for a pragmatic approach, you might want to specify a maximum length at which the given text is truncated. There are people who don't use paragraphs at all, they just type or copy hundreds and thousands of words on a page, like they were writing a novel or like they haven't understood the necessity of formatting at all.

