Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

HTML Tag Remover

by dduncan (Initiate)
on Aug 06, 2000 at 10:05 UTC ( #26410=perlquestion: print w/ replies, xml ) Need Help??
dduncan has asked for the wisdom of the Perl Monks concerning the following question:

I am new to Perl and still learning. I am trying to develop a script that will strip html tags from a html file, leaving raw text. I have gotten to the point where I can open / read / write but not sure which method to purse for the html tag removing. Just looking for a push in the right direction. Thks/Dave

Comment on HTML Tag Remover
Re: HTML Tag Remover
by btrott (Parson) on Aug 06, 2000 at 10:19 UTC
    Why don't you take a look at HTML::FormatText. It'll strip the HTML formatting, plus it will format the text as it would be formatted in HTML. Pretty nice. From the docs:
    require HTML::TreeBuilder; $tree = HTML::TreeBuilder->new->parse_file("test.html"); require HTML::FormatText; $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50); print $formatter->format($tree);
    If you don't want the formatting bits, you might try using HTML::Parser.

    In short, parsing HTML is a tricky thing, and it's best to make use of the already-existing code that was written for this purpose.

(jjhorner)HTML Tag Remover
by jjhorner (Hermit) on Aug 06, 2000 at 10:22 UTC

    Have you ever looked at HTML::Parser? try 'perldoc HTML::Parser' or look at cpan for it.

    Objects of the HTML::Parser class will recognize markup and separate it from plain text (alias data content) in HTML documents. As different kinds of markup and text are recognized, the corresponding event handlers are invoked.

    I would recommend against rolling your own, if you are new.

    J. J. Horner
    Linux, Perl, Apache, Stronghold, Unix
    jhorner@knoxlug.org http://www.knoxlug.org/
    
Re: HTML Tag Remover
by lolindrath (Scribe) on Aug 06, 2000 at 20:12 UTC
    This is how I did it without a module, I think it will work for what you need to do.
    #!/usr/bin/perl -w open FILE, "c:\\html\\vb\\index.html" || die "can't open file"; @text = <FILE>; $text = join( "", @text ); close FILE; #print $text; $text =~ s/(\<(.*?)\>)//sg; print $text;

    I tried this on several of my html files, you need the s option at the end of the replace funtion so that it will remove multi-line tags like comments in javascript.

    --=Lolindrath=--
      That wouldn't work for html such as
      <img src="whatever.gif" alt=">>>Click Here<<<">
        Ok, I added this line before the other regex and it seemed to work, though it is a little specific to that problem. it simple removes anything that has more than one pointy bracket after it. If you want to keep these in you can always replace it with some character and replace it with the pointy brackets after its done with the html tag stripping. This is the revised code
        #!/usr/bin/perl -w open FILE, "c:\\html\\test.html" || die "can't open file"; @text = <FILE>; $text = join( "", @text ); close FILE; #print $text; $text =~ s/>[>+]//g; # < -- Added this line $text =~ s/\<(.*?)\>//sg; print $text;


        --=Lolindrath=--

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://26410]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (10)
As of 2014-09-17 10:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (71 votes), past polls