Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Search and replacing across 500,000 HTML documents

by PodMaster (Abbot)
on Apr 22, 2004 at 10:25 UTC ( [id://347293]=note: print w/replies, xml ) Need Help??


in reply to Search and replacing across 500,000 HTML documents

looks like I'm on the right path, they don't have a way to manipulate plain text in an HTML file -- while still preserving the HTML structure...
I don't know what you've been doing, but you most certainly can. There is an example at (crazyinsomniac) Re: Is this the best way to use HTML::TreeBuilder to bold text in an HTML document?.

Also, a regex is not completely out of the question, something like *code goes here, working on it*

use strict; use warnings; my $name = 'PodMaster'; my $url = 'http://perlmonks.org/?node=PodMaster'; my $html = q~ <html> <title> PodMaster </title> <style> PodMaster { } </style> <body> <h1>PodMaster </h1> Hi there PodMaster blah blah blah <b>Pod</b><i>Master</i> </body> </html> ~; print $/, untag_MOD( $html, $name, $url ), $/; #http://perlmonks.org/?node_id=161281 modified for our purposes sub untag_MOD { local $_ = $_[0] || $_; # ALGORITHM: # find < , # comment <!-- ... -->, # or comment <? ... ?> , # or one of the start tags which require correspond # end tag plus all to end tag # or if \s or =" # then skip to next " # else [^>] # > # 1 is the entire "tag", add +1 to all numbers in comments s{ ( # podmaster < # open tag (?: # open group (A) (!--) | # comment (1) or (\?) | # another comment (2) or (?i: # open group (B) for /i ( TITLE | # one of start tags SCRIPT | # for which APPLET | # must be skipped OBJECT | # all content STYLE # to correspond ) # end tag (3) ) | # close group (B), or ([!/A-Za-z]) # one of these chars, remember in (4) ) # close group (A) (?(5) # if previous case is (4) (?: # open group (C) (?! # and next is not : (D) [\s=] # \s or "=" ["`'] # with open quotes ) # close (D) [^>] | # and not close tag or [\s=] # \s or "=" with `[^`]*` | # something in quotes ` or [\s=] # \s or "=" with '[^']*' | # something in quotes ' or [\s=] # \s or "=" with "[^"]*" # something in quotes " )* # repeat (C) 0 or more times | # else (if previous case is not (4)) .*? # minimum of any chars ) # end if previous char is (4) (?(2) # if comment (1) (?<=--) # wait for "--" ) # end if comment (1) (?(3) # if another comment (2) (?<=\?) # wait for "?" ) # end if another comment (2) (?(4) # if one of tags-containers (3) </ # wait for end (?i:\4) # of this tag (?:\s[^>]*)? # skip junk to ">" ) # end if (3) > # tag closed ) ([^<]*) # 6, text } ' my $ret = $1; if( $6 ){ my $text = $6; $text =~ s~\b(\Q$_[1]\E)\b~<a href="$_[2]">$1</a>~g; # add + link $ret .= $text; } $ret; 'gsxe; return $_ ? $_ : ""; } __END__
Note the caveats in strip HTML tags. Another potential (i wouldn't consider it one) caveat is that both of these don't translate <b>Pod</b><i>Master</i> into a link. If you want to do that you should use HTML::TreeBuilder.

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://347293]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-04-24 06:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found