|Welcome to the Monastery|
Re^2: Converting HTML to txt with HTML::Stripby elef (Friar)
|on Oct 04, 2010 at 16:08 UTC||Need Help??|
Well, yes, the BRK tags should be conserved with the lt and gt character references converted to < and > (everything that's "in the text", i.e. everything that isn't part of the HTML markup should stay in).
Frankly, most of your actual code went right over my head. I'm pretty new to perl and programming in general.
I'm not sure what you mean about the the numerical entities not being in the file. They are in the original HTML file and should be converted to the appropriate characters, e.g. 336 is the accented letter Ő.
Either way, now I have a solution I'm happy with (the workaround I posted). It's not elegant, but it does everything I want it to so I think I'll stick with it.
By the way, it's pretty surprising that there seems to be no foolproof HTML->txt converter module that would just let you just provide a path to an HTML file and spit out a UTF-8 txt with the right line breaks, all the character entities decoded etc.
I.e. instead of the 20 or so lines you and I posted, it should be
... and you'd get file.txt created in the same folder.