Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Search and replace everything except html tags

by thatguy (Parson)
on Jun 06, 2000 at 09:13 UTC ( [id://16557]=perlquestion: print w/replies, xml ) Need Help??

thatguy has asked for the wisdom of the Perl Monks concerning the following question:

Hey all, I'm playing around with trying to replace text in html while not breaking the html.
My problem is that I know how to replace text in all of the html file, but I can't figure out how to exclude the html tags (like <font>).

I've tried the following on a local file, and was hoping for a bit of pointing to the right area. (where index.html has each tag and text phrase on a newline like such)
@open = `cat temp.html`; foreach (@open) { $_ =~ s/\n//ig; if ( "$_" eq "<(.*)>" ) { print "html: $_\n"; }else{ print "text: $_\n"; } } exit;
yup, that ignores the <(.*)> and also <*> any ideas where I am going wrong?

Replies are listed 'Best First'.
Re: Search and replace everything except html tags
by Corion (Patriarch) on Jun 06, 2000 at 11:34 UTC

    eq does not work with regular expressions, but only for a direct string match :

    my $a = "foo"; my $b = "bar"; my $c = ".*"; print "eq" if ($a eq $a); # prints "eq" print "ne" if ($a ne $b); # prints "ne" print "ne" if ($a ne $a); # prints nothing print "RE" if $a =~ /$c/; # prints "RE" print "RE" if $a =~ /f.*/; # prints "RE"

    What you maybe wanted was something along these lines (tested :) ):

    #!/usr/bin/perl -w use strict; my $filename = $ARGV[0] || "temp.html"; my $open; undef $/; # undefine all line separators open( FILE, $filename ) or die "Couldnīt open $filename : $!\n"; $open = <FILE>; # This slurps the whole file into one scalar (ins +tead of an array) close FILE; # I'll take a simplicistic approach that assumes that # the only place where a ">" occurs is at the start of # a tag. This does fail when you have for example : # <IMG src="less.png" alt="a > b"> # which is valid HTML from what I know. # I also ignore scripts and comment handling. while ($open) { # Match text followed by a tag into $1 and (if a tag follows exist +s) $2: $open =~ s/^([^<]+)?(<[^>]+>)?//; print "Text : $1\n" if $1; print "HTML: $2\n" if $2; }; # the real meat of the code is the "s///;" line # it works as follows : # The two parenthesed parts capture stuff, # the first parentheses capture non-tagged text # the second parentheses capture text that is # within "<" and ">" # one or both of the parentheses are allowed to be empty # Everything that is found is deleted from the start of # the string. # repeat as long as there is stuff in the slurped line

    Of course, everything above could maybe be done more correct by using one of the HTML modules, like HTML::Parser - maybe you want to take a look at these modules. takshaka has mentioned a previous discussion of this topic where a working example of usage of HTML::Parser was posted by him - a direct link is here.

    For more information about regular expressions read the perlre manpage.

Re: Search and replace everything except html tags
by takshaka (Friar) on Jun 06, 2000 at 11:46 UTC
    See Substitution outside of HTML TAGS for a previous answer to this question. @open = `cat temp.html`; Perl has perfectly good functions for opening and reading files that don't require you to fork another shell.
    foreach (@open) { $_ =~ s/\n//ig;
    You don't need the '$_ =~' here. Substitution operates on $_ by default. Also, this is an occassion where tr/// would be a better choice of operators. if ( "$_" eq "<(.*)>" ) { You need a regex here instead of stringwise equality, but even then it won't do what you expect. Also don't get into the bad habit of quoting scalars. If you want $_, just say $_, not "$_".

    Here's one way to do this with regular expressions. Regexes, however, invariably fail on "real world" HTML. The prefered method is to use HTML::Parser or its derivatives.

    #!/usr/bin/perl -w use strict; open HTML, "temp.html" or die "Can't open file: $!\n"; { local $/; $_ = <HTML>; } close HTML; tr/\n / /s; while ( /([^<>]*)(<[^>]*>)?/g ) { print "TEXT: $text\n" if defined $1; print "HTML: $html\n" if $2; }
Re: Search and replace everything except html tags
by mojotoad (Monsignor) on Jun 06, 2000 at 14:39 UTC

    Okay, there have been references to HTML::Parser here, but they have all avoided the "easy" solution with that route. Most fall into traps and do not fully address your original question. The original question, as I understand it, involves targeting specific text segments within a certain zone in an HTML document, ignoring embedded HTML tags within that zone, and then ceasing extraction beyond that target zone in the HTML document.

    HTML::Parser has recently evolved, and I invite you to check out the new syntax, but for the time being I will stick to the "old" syntax, which is still quite compatible with the current release.

    Use the begin() and end() callback methods to keep track of your context. Use the text() method to slurp text segments into a cache, up until your end() method ceases the requirement condition for that zone. end() will also dump the cache when a zone scan is complete.

    This simultaneously keeps track of your context, as well as neatly extracting your text from further embedded tags. Regexp be damned -- they only need to be applied to the cache result, not the HTML.

    I am not involved with the development of HTML::Parser, but I do use the module extensively.

    Mojotoad

Re: Search and replace everything except html tags
by Anonymous Monk on Jun 06, 2000 at 11:53 UTC
    1. You cannot do this line by line as there may be some html on the same line as some text.
    2. eq does not use perl Regexp
    That being said, here is a script that will do what you want. It is completely horrible as it is 2am but it will start you down the correct path. I'm sure others around here will have a more concice way of doing it.
    open IN_FILE, "<temp.html"; while(<IN_FILE>) { my $text = $_; my $before; my $tag; my $after; while ($text) { if( ($before,$tag,$after) = ($text =~ /(.*?)<(.*?)>(.*)/) ) { print "text: $before\n" unless $before eq ""; print "html: $tag\n"; $text = $after; } else { print "text: $text\n"; $text = undef; } } }
    Cheers!
    ---
    crulx
    crulx@iaxs.net
Re: Search and replace everything except html tags
by athomason (Curate) on Jun 06, 2000 at 11:40 UTC
    Well, "$_" eq "<(.*)>" will only match if $_ equals that string, i.e. you're not doing a regexp match there. If you don't want to reinvent the wheel, take a look at HTML::Parser. If you do, and you're positive that each line contains exclusively a tag or line of text, you could do something like this to grab just the tags:
    use strict; open FILE, "<temp.html"; my @tags = grep /^<.*>$/, <FILE>; print @tags;
    If you actually want to separate into tags and text:
    use strict; open FILE, "<temp.html"; my (@tags, @text); while (<FILE>) { chomp; if (/^<.*>$/) { push @tags, $_; } else { push @text, $_; } } print "Tags:\n", join "\n", @tags; print "\nText:\n", join "\n", @text;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://16557]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-03-28 14:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found