|Perl: the Markov chain saw|
Truncating HTML earlyby nop (Hermit)
|on Mar 17, 2002 at 09:44 UTC||Need Help??|
nop has asked for the
wisdom of the Perl Monks concerning the following question:
I am working with a certain field that sometimes is quite long. Its is stored as TEXT in SqlServer, and may go on for 20 pages in extreme cases.
I need to repurpose this information for a different use, and it has to be shorter. My goal is to chop the HTML somewhere around the first 1000 words or so, and if the original text had been longer (eg if my truncation removed content), append a message "Click here for rest of article" sort of deal.
My question is, how do I truncate HTML cleanly? By "cleanly," I mean so that, after my truncation, my chopped-and-patched HTML is well-formed.
Clearly I can use HTML::Parser to avoid chopping a tag in half, but how do I know I'm not in the middle of a table or inside the label of a link when I chop?
Since it is possible I'm always inside a tag (say the entire field is wrapped in open and close SPAN tags), probably my best bet is to close all open tags when I truncate.
I could keep track of my open tags using a stack, pushing on opens and popping off closes (hmmm... would also let me check for badly-nested tags at same time, which I know will reveal problems), but then how do I know when a tag doesn't need a closing tag? That is, if I blindly push tags when they open, my stack will be loaded with IMG tags, HR tags, P tags, etc.
Can someone point me to a list of tags that don't need to be closed, or, better, offer a better way to approach this problem? I didn't find anything here or on CPAN.