Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: Converting HTML to txt with HTML::Strip

by elef (Friar)
on Oct 04, 2010 at 16:08 UTC ( #863367=note: print w/ replies, xml ) Need Help??


in reply to Re: Converting HTML to txt with HTML::Strip
in thread Converting HTML to txt with HTML::Strip

Well, yes, the BRK tags should be conserved with the lt and gt character references converted to < and > (everything that's "in the text", i.e. everything that isn't part of the HTML markup should stay in).
Frankly, most of your actual code went right over my head. I'm pretty new to perl and programming in general.
I'm not sure what you mean about the the numerical entities not being in the file. They are in the original HTML file and should be converted to the appropriate characters, e.g. 336 is the accented letter Ő.
Either way, now I have a solution I'm happy with (the workaround I posted). It's not elegant, but it does everything I want it to so I think I'll stick with it.
By the way, it's pretty surprising that there seems to be no foolproof HTML->txt converter module that would just let you just provide a path to an HTML file and spit out a UTF-8 txt with the right line breaks, all the character entities decoded etc.
I.e. instead of the 20 or so lines you and I posted, it should be

#! /usr/bin/perl use warnings; use strict; use HTML::Convert; HTML::Convert(file.html);
... and you'd get file.txt created in the same folder.


Comment on Re^2: Converting HTML to txt with HTML::Strip
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://863367]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2014-12-20 06:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (95 votes), past polls