Truncate HTML String

amacks has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to extract the beginning of an HTML string, in a way that returns legal HTML, basically to generate a summary page of blog entries for the index. For an example entry

First a technological update - The code that drives this site is avail
+able for free on <a href="https://github.com/amacks/vatican_mss">GitH
+ub</a>.  I've just merged in a rather complex change to create proper
+ shelfmark sorting, fixing things like numbers-stored-as-strings and 
+handling roman numerals.  Two problems yet unfixed are Fonds <strong>
+P.I.O</strong>, with the middle "I" reading as a roman numeral, and <
+strong>Arch.Cap.S.Pietro</strong> where sub-set "I" is read as roman 
+1 and everything gets confused.</p>...
[download]

a naive perl substr($_,0,100), returns text ending "free on <a href="htt", which is bad

Comment on Truncate HTML String Download Code

Replies are listed 'Best First'.

Re: Truncate HTML String
by Your Mother (Archbishop) on Dec 04, 2019 at 15:26 UTC

use strict;
use warnings;
use HTML::Truncate;

my $snippet = <<"";
First a technological update - The code that drives this site is avail
+able for free on <a href="https://github.com/amacks/vatican_mss">GitH
+ub</a>.  I've just merged in a rather complex change to create proper
+ shelfmark sorting, fixing things like numbers-stored-as-strings and 
+handling roman numerals.  Two problems yet unfixed are Fonds <strong>
+P.I.O</strong>, with the middle "I" reading as a roman numeral, and <
+strong>Arch.Cap.S.Pietro</strong> where sub-set "I" is read as roman 
+1 and everything gets confused.</p>

my $ht = HTML::Truncate->new();
$ht->chars(100);
print $ht->truncate($snippet), $/; 

__END__
First a technological update - The code that drives this site is avail
+able for free on <a href="https://github.com/amacks/vatican_mss">GitH
+ub</a>. I've&#8230;
[download]

HTML::Truncate. Long time since I used this for anything but it helped me out with similar needs a long time ago. There are lots of low level tools to do this kind of thing but you end up having to do a lot of #text character counting and such.

And that’s why this one breaks much later than your raw substr; it’s only counting displayed characters, not HTML content.

Update: if we correct your omission of the opening paragraph tag in the input, this is the output–

<p>First a technological update - The code that drives this site is av
+ailable for free on <a href="https://github.com/amacks/vatican_mss">G
+itHub</a>. I've&#8230;</p>
[download]

[reply]
[d/l]
[select]


The stupid question is the question not asked
	PerlMonks