Enlightened Ones and other Seekers of Widsom,
For my wife, who is a software translator, I am trying to achieve the following:
I have a glossary in HTML, implemented as a definition list. After translation, the glossary naturally needs to be re-sorted.
I already wrote a solution with Regular Expressions but with HTML being hard to parse, it is not very efficient so far...
Thus I'd like to use HTML::TreeBuilder
It's quite easy when the glossary was a two column table (check my scratchpad if you are interested:
svenXY's scratchpad), but with a definition list,
the problem is that the <dt> and the <dd> tag are independent of each other. I can well sort the dt tag, but how do I at the same time sort the dd tag with it?
I have a solution here, but I don't really like it. I'm sure there are better ways to do it
#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder;
use HTML::PrettyPrinter;
use Data::Dumper;
my $html_code = '
<html>
<head>
<title>Glossary</title>
<h1>Glossary</h1>
<dl>
<dt><b>E Definition</b></dt>
<dd>E - data</dd>
<p></p>
<dt><b>B Definition</b></dt>
<dd>B - data</dd>
<p></p>
<dt><b>A_definition</b></dt>
<dd>A data.</dd>
<p></p>
<dt><b>C definition</b></dt>
<dd>C - data</dd>
<p></p>
</dl>
</body>
</html>
';
my %glossar;
my $tree = HTML::TreeBuilder->new;
$tree->parse($html_code);
my ($dl) = $tree->look_down('_tag', 'dl');
my %data;
# looping trough the dt tags,
# spawning a hash with the text of dt as key
# and the HTML of dt and dd as values
for my $dt ($dl->look_down("_tag", "dt")) {
my $key = lc($dt->as_text);
$data{$key}{'dt'} = $dt->as_HTML;
my $dd = $dt->right;
$data{$key}{'dd'} = $dd->as_HTML;
}
# create a string
my $output;
foreach (sort {lc($a) cmp lc($b)} keys %data) {
$output .= $data{$_}{'dt'} . $data{$_}{'dd'} . "<p></p>";
}
# feed the string to a new Parser Object
my $new_dl = HTML::TreeBuilder->new;
$new_dl->parse($output);
my $nu_aber = ();
# remove unneccesary tags
$nu_aber = $new_dl->guts();
# replace old dl with new dl
$dl->delete_content();
$dl->push_content($nu_aber);
my $hpp = new HTML::PrettyPrinter (
'linelength' => 130,
'quote_attr' => 1,
'allow_forced_nl' => 1,
'entities' => "&<>äöüßÄÖÜ");
$hpp->set_force_nl(1,qw(body head table tr td));
$hpp->nl_before(2,qw(tr td p));
my $linearray_ref = $hpp->format($tree);
print @{$linearray_ref};
$tree = $tree->destroy;
My main problem is to properly dereference the tree and to replace the DL part of the tree with a sorted array of HTML::Element Objects without having to create and parse code first.
Any hints greatly appreciated,
svenXY