Editing HTML files

spivey49 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Editing HTML files by Tanktalus (Canon) on Jul 08, 2008 at 21:15 UTC
From a quick perusal of HTML::Manipulator, it seems that it presupposes you have some HTML in a certain format: either with id tags (by definition these are unique) or comment markers. If you have that, great. If not, and you can't get there, then you'll have to look for another way. Generally, I can get away with making all my HTML actually XHTML-compliant. Which then means I can whip out my favourite swiss-army knife: XML::Twig. If your HTML already is XHTML, then this becomes really easy - the complicated part will be coming up with the XPath, but that shouldn't be too hard ... something like `//div[string()="nothing"]` is my guess. The get_xpath function will return all the tag objects at once, and you can just loop through, change the text for each one (set_text), and print it all back out. Otherwise, you'll probably have to roll your own with HTML::Parser...	[reply] [d/l]
Re^2: Editing HTML files by spivey49 (Monk) on Jul 08, 2008 at 21:33 UTC
Thanks Tanktalus. The problem with XHTML is the files are generated as HTML and get overwritten by another process. The problem with reg expressions is the tag values are unpredictable. I saw the same issue when glancing over the HTML:Manipulator docs. I was hoping I missed something or someone might have seen this tag format before. I'll give HTML::Parser a shot.	[reply]
Re^3: Editing HTML files by pc88mxer (Vicar) on Jul 08, 2008 at 21:49 UTC
The problem with reg expressions is the tag values are unpredictable. Well, in that case a proper parser is a better approach.	[reply]
Re^3: Editing HTML files by Lawliet (Curate) on Jul 08, 2008 at 21:56 UTC
If the tag values are unpredictable, use metacharacters. Then again, depending on the html files your editing, it may not work `<(^.^-<) <(-^.^<) <(-^.^-)> (>^.^-)> (>-^.^)>`	[reply] [d/l]
Re: Editing HTML files by pc88mxer (Vicar) on Jul 08, 2008 at 21:06 UTC
It sounds like you could do this with a regular expression. `use File::Slurp; sub fix_file { my $html = read_file($_[0]); my $replacement = ...read in replacement string... if ($html = s{<div align="center">nothing</div>} {<div align="center">$replacement</div>}) # use g modifier? { ...write file... } }` [download]	[reply] [d/l]
Re^2: Editing HTML files by GertMT (Hermit) on Jul 08, 2008 at 21:09 UTC
or make a backup just to make sure and then while in the directory with the html files maybe: `perl -pi -e 's/find/replace/g' *.html` [download]	[reply] [d/l]
Re: Editing HTML files by wfsp (Abbot) on Jul 09, 2008 at 06:21 UTC
Here's my go with HTML::TreeBuilder. The docs discuss inline editing under the `$h->contents_refs_list` You may have to adjust depending on how "unpredictable" the HTML is. #!/usr/local/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $root = HTML::TreeBuilder->new_from_file(*DATA) or die qq{cant build tree\n}; my $noclass = $root->look_down( _tag => q{div}, class => q{noclass}, ); die qq{noclass not found\n} unless $noclass; my $replaced; for my $item_r ($noclass->content_refs_list) { next if ref ${$item_r}; ${$item_r} = lookup_replacement(${$item_r}); $replaced++; } die qq{no replace\n} unless $replaced; my $html = $root->as_HTML(undef, qq{ }, {}); print qq{$html\n}; sub lookup_replacement{ my $lookup = shift; # find replacement return q{something}; } __DATA__ <html> <head> <title>tb_test</title> </head> <body> <div align="center" class="noclass">nothing</div> </body> </html> [download] output `<html> <head> <title>tb_test</title> </head> <body> <div align="center" class="noclass">something</div> </body> </html>` [download]	[reply] [d/l] [select]
Re^2: Editing HTML files by spivey49 (Monk) on Jul 10, 2008 at 21:41 UTC
Thanks wfsp! Works perfectly.	[reply]
Re^2: Editing HTML files by spivey49 (Monk) on Jul 14, 2008 at 16:32 UTC
So this code works perfectly until I come across a blank tag. Any suggestions how to handle a situation where the tag exists, but has no text? Currently when the script comes across this situation it dies instead of inserting the new text. Here's the code with sample html: Code: #Find the indext.txt config files #and edit the index.htm files one dir up from the config file use warnings; use HTML::TreeBuilder; use File::Find; my $dir = $ARGV[0]; my $html_file; my $cfg_path; my $index_pref; find(\&file_finds, $dir); sub file_finds{ if ($_=/index.htm/){ $html_file = $File::Find::name; $cfg_path = $File::Find::dir."/index/index.txt"; &config; &edit; } } sub config{ open (CF, "$cfg_path") or die ("Can't open $cfg_path $!"); while (<CF>) { chomp; # no newline s/#.//; # no comments s/^\s+//; # no leading white s/\s+$//; # no trailing white next unless length; # anything left? my ($var, $value) = split(/\s=\s*/, $_, 2); $index_pref{$var} = $value; } close CF; } sub edit{ my $root = HTML::TreeBuilder->new_from_file($html_file) or die qq{cant build tree\n}; my $class1 = $root->look_down( _tag => q{div}, class => q{class1}, ); die qq{client not found\n$html_file\n} unless $class1; my $class2 = $root->look_down( _tag => q{div}, class => q{class2}, ); die qq{class2 not found\n$html_file\n} unless $class2; my $class3 = $root->look_down( _tag => q{div}, class => q{class3}, ); die qq{class3 not found\n$html_file\n} unless $class3; my $rep_class1; for my $item_r ($class1->content_refs_list) { next if ref ${$item_r}; ${$item_r} = $index_pref{"class1"}; $rep_class1++; } die qq{Class1 not replaced\n$html_file\n} unless $rep_class1; my $rep_class2; for my $item_r ($class2->content_refs_list) { next if ref ${$item_r}; ${$item_r} = $index_pref{"class2"}; $rep_class2++; } die qq{Class2 not replaced\n$html_file\n} unless $rep_class2; my $rep_class3; for my $item_r ($class3->content_refs_list) { next if ref ${$item_r}; ${$item_r} = $index_pref{"class3"}; $rep_class3++; } die qq{Class3 not replaced\n$html_file\n} unless $rep_class3; my $html = $root->as_HTML(undef, qq{ }, {}); open (FH, ">$html_file") or die $!; print FH $html; close FH; } [download] `HTML: <!--Class 3 has no text--> <div align="center" class="class1">something</div> <div align="center" class="class2">something else</div> <div align="center" class="class3"></div>` [download]	[reply] [d/l] [select]
Re^3: Editing HTML files by wfsp (Abbot) on Jul 15, 2008 at 07:47 UTC
Have a look at `$h->splice_content(...)`. The cut down example below inserts new text immediately after the opening div tag (if a text element is not found). #!/usr/local/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $html = do{local $/;<DATA>}; my $replace = q{replaced}; my $edited = edit($html, $replace); print $edited; sub edit{ my $html = shift; my $replace = shift; # my $root = HTML::TreeBuilder->new_from_file($html_file) # or die qq{cant build tree\n}; my $root = HTML::TreeBuilder->new_from_content($html) or die qq{cant build tree\n}; my $class3 = $root->look_down( _tag => q{div}, class => q{class3}, ); die qq{class3 not found\n} unless $class3; my $rep_class3; for my $item_r ($class3->content_refs_list) { next if ref ${$item_r}; ${$item_r} = $replace; $rep_class3++; } if (not $rep_class3){ $class3->splice_content(1, 0, $replace); } #die qq{Class3 not replaced\n} unless $rep_class3; my $edited_html = $root->as_HTML(undef, qq{ }, {}); return $edited_html; } __DATA__ <div align="center" class="class1">something</div> <div align="center" class="class2">something else</div> <div align="center" class="class3"></div> [download] output `<html> <head> </head> <body> <div align="center" class="class1">something</div> <div align="center" class="class2">something else</div> <div align="center" class="class3">replaced</div> </body> </html>` [download]	[reply] [d/l] [select]
Re^4: Editing HTML files by spivey49 (Monk) on Jul 15, 2008 at 14:10 UTC


Syntactic Confectionery Delight
	PerlMonks