Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re: Converting HTML to txt with HTML::Strip

by wfsp (Abbot)
on Oct 03, 2010 at 13:43 UTC ( #863178=note: print w/replies, xml ) Need Help??

in reply to Converting HTML to txt with HTML::Strip

This uses HTML::TokeParser::Simple (there are many other parsers) and may help get you started. It preserves your <BRK> 'tags', is that what you were after?
#! /usr/bin/perl use warnings; use strict; use HTML::Entities; use HTML::TokeParser::Simple; my $p = HTML::TokeParser::Simple->new( q{monk.html}, ) or die qq{cant parse HTML}; open my $fh_out, q{>:utf8}, q{out.txt} or die qq{cant open file to write}; while (my $t = $p->get_token){ if ($t->is_end_tag(q{p}) or $t->is_tag(q{br})){ print $fh_out qq{\n}; } elsif ($t->is_text){ my $out = $t->as_is; for ($out){ s/^\s+//; s/\s+$//; } next unless $out; print $fh_out decode_entities($out); } }
output (long lines snipped)
JACOBS F&#336;TANÁCSNOK INDÍTVÁNYA<BRK> Az ismertetés napja: 2005. november 17.1(1) C&#8209;371/03. sz. ügy Siegfried Aulinger<BRK> kontra<this should be left in> Bundesrepublik Deutschland 1.<BRK>        Ebben az ügyben az... Európai Gazdasági Közösség közötti... az embargóról szóló rendelet)(2)...
Some numeric entities appear here (in the browser), e.g. &#336;, these aren't in the file.

Replies are listed 'Best First'.
Re^2: Converting HTML to txt with HTML::Strip
by elef (Friar) on Oct 04, 2010 at 16:08 UTC
    Well, yes, the BRK tags should be conserved with the lt and gt character references converted to < and > (everything that's "in the text", i.e. everything that isn't part of the HTML markup should stay in).
    Frankly, most of your actual code went right over my head. I'm pretty new to perl and programming in general.
    I'm not sure what you mean about the the numerical entities not being in the file. They are in the original HTML file and should be converted to the appropriate characters, e.g. 336 is the accented letter Ő.
    Either way, now I have a solution I'm happy with (the workaround I posted). It's not elegant, but it does everything I want it to so I think I'll stick with it.
    By the way, it's pretty surprising that there seems to be no foolproof HTML->txt converter module that would just let you just provide a path to an HTML file and spit out a UTF-8 txt with the right line breaks, all the character entities decoded etc.
    I.e. instead of the 20 or so lines you and I posted, it should be
    #! /usr/bin/perl use warnings; use strict; use HTML::Convert; HTML::Convert(file.html);
    ... and you'd get file.txt created in the same folder.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://863178]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2018-06-20 18:10 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (117 votes). Check out past polls.