Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re^2: Converting HTML to txt with HTML::Strip

by elef (Friar)
on Oct 04, 2010 at 16:08 UTC ( #863367=note: print w/replies, xml ) Need Help??

in reply to Re: Converting HTML to txt with HTML::Strip
in thread Converting HTML to txt with HTML::Strip

Well, yes, the BRK tags should be conserved with the lt and gt character references converted to < and > (everything that's "in the text", i.e. everything that isn't part of the HTML markup should stay in).
Frankly, most of your actual code went right over my head. I'm pretty new to perl and programming in general.
I'm not sure what you mean about the the numerical entities not being in the file. They are in the original HTML file and should be converted to the appropriate characters, e.g. 336 is the accented letter Ő.
Either way, now I have a solution I'm happy with (the workaround I posted). It's not elegant, but it does everything I want it to so I think I'll stick with it.
By the way, it's pretty surprising that there seems to be no foolproof HTML->txt converter module that would just let you just provide a path to an HTML file and spit out a UTF-8 txt with the right line breaks, all the character entities decoded etc.
I.e. instead of the 20 or so lines you and I posted, it should be
#! /usr/bin/perl use warnings; use strict; use HTML::Convert; HTML::Convert(file.html);
... and you'd get file.txt created in the same folder.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://863367]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2017-06-29 06:10 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (653 votes). Check out past polls.