Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Extracting paragraphs from html

by fraktalisman (Hermit)
on Sep 11, 2005 at 12:55 UTC ( #491073=note: print w/replies, xml ) Need Help??


in reply to Extracting paragraphs from html

If you can't rely on certain tags (and I agree that you can't), the question is, what is the definition of a paragraph?

Where does it stop? Certainly not at a newline, for we are dealing with HTML, and there might be many newlines in the source code where they are not visible in the page that is actually displayed.
So what would possibly terminate a paragraph?

  • A closing tag of a block element, like </div> </p> etc.
  • More than one break, i.e. <br> <br> without words or images between them
  • The start of another paragraph or block element, like <div> <p> <iframe> <hr> etc.
  • An image <img>
  • The end of the page or document

And for a pragmatic approach, you might want to specify a maximum length at which the given text is truncated. There are people who don't use paragraphs at all, they just type or copy hundreds and thousands of words on a page, like they were writing a novel or like they haven't understood the necessity of formatting at all.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://491073]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (9)
As of 2019-06-20 11:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Is there a future for codeless software?



    Results (89 votes). Check out past polls.

    Notices?