Re: Search and replacing across 500,000 HTML documents

looks like I'm on the right path, they don't have a way to manipulate plain text in an HTML file -- while still preserving the HTML structure...

I don't know what you've been doing, but you most certainly can. There is an example at (crazyinsomniac) Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?.

Also, a regex is not completely out of the question, something like ~~*code goes here, working on it*~~

use strict;
use warnings;
my $name = 'PodMaster';
my $url  = 'http://perlmonks.org/?node=PodMaster';
my $html = q~
<html>
<title> PodMaster </title>
<style>
PodMaster { }
</style>
<body>
<h1>PodMaster
</h1>
Hi there PodMaster blah blah blah <b>Pod</b><i>Master</i>
</body>
</html>
~;


print $/, untag_MOD( $html, $name, $url ), $/;

#http://perlmonks.org/?node_id=161281  modified for our purposes
sub untag_MOD {
  local $_ = $_[0] || $_;
# ALGORITHM:
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
# 1 is the entire "tag", add +1 to all numbers in comments 
  s{
  ( # podmaster
    <               # open tag
    (?:             # open group (A)
      (!--) |       #   comment (1) or
      (\?) |        #   another comment (2) or
      (?i:          #   open group (B) for /i
        ( TITLE  |  #     one of start tags
          SCRIPT |  #     for which
          APPLET |  #     must be skipped
          OBJECT |  #     all content
          STYLE     #     to correspond
        )           #     end tag (3)
      ) |           #   close group (B), or
      ([!/A-Za-z])  #   one of these chars, remember in (4)
    )               # close group (A)
    (?(5)           # if previous case is (4)
      (?:           #   open group (C)
        (?!         #     and next is not : (D)
          [\s=]     #       \s or "="
          ["`']     #       with open quotes
        )           #     close (D)
        [^>] |      #     and not close tag or
        [\s=]       #     \s or "=" with
        `[^`]*` |   #     something in quotes ` or
        [\s=]       #     \s or "=" with
        '[^']*' |   #     something in quotes ' or
        [\s=]       #     \s or "=" with
        "[^"]*"     #     something in quotes "
      )*            #   repeat (C) 0 or more times
    |               # else (if previous case is not (4))
      .*?           #   minimum of any chars
    )               # end if previous char is (4)
    (?(2)           # if comment (1)
      (?<=--)       #   wait for "--"
    )               # end if comment (1)
    (?(3)           # if another comment (2)
      (?<=\?)       #   wait for "?"
    )               # end if another comment (2)
    (?(4)           # if one of tags-containers (3)
      </            #   wait for end
      (?i:\4)       #   of this tag
      (?:\s[^>]*)?  #   skip junk to ">"
    )               # end if (3)
    >               # tag closed
   )
   ([^<]*) # 6, text
   }
   '
       my $ret = $1;
       if( $6 ){
           my $text = $6;
           $text =~ s~\b(\Q$_[1]\E)\b~<a href="$_[2]">$1</a>~g;  # add
+ link
           $ret .= $text;
       }
       $ret;
   'gsxe;
  return $_ ? $_ : "";
}


__END__
[download]

Note the caveats in strip HTML tags. Another potential (i wouldn't consider it one) caveat is that both of these don't translate <b>Pod</b><i>Master</i> into a link. If you want to do that you should use HTML::TreeBuilder.

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

Comment on Re: Search and replacing across 500,000 HTML documents Select or Download Code


Your skill will accomplish what the force of many cannot
	PerlMonks