Re: Converting HTML to txt with HTML::Strip

in reply to Converting HTML to txt with HTML::Strip

This uses HTML::TokeParser::Simple (there are many other parsers) and may help get you started. It preserves your <BRK> 'tags', is that what you were after?

#! /usr/bin/perl

use warnings;
use strict;

use HTML::Entities;
use HTML::TokeParser::Simple;

my $p = HTML::TokeParser::Simple->new(
  q{monk.html},
) or die qq{cant parse HTML};

open my $fh_out, q{>:utf8}, q{out.txt}
  or die qq{cant open file to write};

while (my $t = $p->get_token){
  
  if ($t->is_end_tag(q{p}) or $t->is_tag(q{br})){
    print $fh_out qq{\n};
  }
  elsif ($t->is_text){
    my $out = $t->as_is;
    for ($out){
      s/^\s+//;
      s/\s+$//;
    }
    next unless $out;
    print $fh_out decode_entities($out);
  }
  
}
[download]

output (long lines snipped)

JACOBS
F&#336;TANÁCSNOK INDÍTVÁNYA<BRK>
Az ismertetés napja: 2005. november 17.1(1)
C&#8209;371/03. sz. ügy
Siegfried Aulinger<BRK>
kontra<this should be left in>
Bundesrepublik Deutschland






1.<BRK>        Ebben az ügyben az...
         Európai Gazdasági Közösség közötti...
         az embargóról szóló rendelet)(2)...
[download]

Some numeric entities appear here (in the browser), e.g. Ő, these aren't in the file.

In Section Seekers of Perl Wisdom