Re^2: Converting HTML to txt with HTML::Strip

by elef (Friar)
on Oct 04, 2010 at 16:08 UTC

in reply to Re: Converting HTML to txt with HTML::Strip
in thread Converting HTML to txt with HTML::Strip

Well, yes, the BRK tags should be conserved with the lt and gt character references converted to < and > (everything that's "in the text", i.e. everything that isn't part of the HTML markup should stay in).
Frankly, most of your actual code went right over my head. I'm pretty new to perl and programming in general.
I'm not sure what you mean about the the numerical entities not being in the file. They are in the original HTML file and should be converted to the appropriate characters, e.g. 336 is the accented letter Ő.
Either way, now I have a solution I'm happy with (the workaround I posted). It's not elegant, but it does everything I want it to so I think I'll stick with it.
By the way, it's pretty surprising that there seems to be no foolproof HTML->txt converter module that would just let you just provide a path to an HTML file and spit out a UTF-8 txt with the right line breaks, all the character entities decoded etc.
I.e. instead of the 20 or so lines you and I posted, it should be
#! /usr/bin/perl use warnings; use strict; use HTML::Convert; HTML::Convert(file.html);
... and you'd get file.txt created in the same folder.

