Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

HTML::TreeBuilder: sort a Definition List (<dl>)

by svenXY (Deacon)
on Sep 12, 2005 at 08:48 UTC ( [id://491185]=perlquestion: print w/replies, xml ) Need Help??

svenXY has asked for the wisdom of the Perl Monks concerning the following question:

Enlightened Ones and other Seekers of Widsom,

For my wife, who is a software translator, I am trying to achieve the following:

I have a glossary in HTML, implemented as a definition list. After translation, the glossary naturally needs to be re-sorted.
I already wrote a solution with Regular Expressions but with HTML being hard to parse, it is not very efficient so far... Thus I'd like to use HTML::TreeBuilder

It's quite easy when the glossary was a two column table (check my scratchpad if you are interested: svenXY's scratchpad), but with a definition list, the problem is that the <dt> and the <dd> tag are independent of each other. I can well sort the dt tag, but how do I at the same time sort the dd tag with it?


I have a solution here, but I don't really like it. I'm sure there are better ways to do it
#!/usr/bin/perl -w use strict; use HTML::TreeBuilder; use HTML::PrettyPrinter; use Data::Dumper; my $html_code = ' <html> <head> <title>Glossary</title> <h1>Glossary</h1> <dl> <dt><b>E Definition</b></dt> <dd>E - data</dd> <p></p> <dt><b>B Definition</b></dt> <dd>B - data</dd> <p></p> <dt><b>A_definition</b></dt> <dd>A data.</dd> <p></p> <dt><b>C definition</b></dt> <dd>C - data</dd> <p></p> </dl> </body> </html> '; my %glossar; my $tree = HTML::TreeBuilder->new; $tree->parse($html_code); my ($dl) = $tree->look_down('_tag', 'dl'); my %data; # looping trough the dt tags, # spawning a hash with the text of dt as key # and the HTML of dt and dd as values for my $dt ($dl->look_down("_tag", "dt")) { my $key = lc($dt->as_text); $data{$key}{'dt'} = $dt->as_HTML; my $dd = $dt->right; $data{$key}{'dd'} = $dd->as_HTML; } # create a string my $output; foreach (sort {lc($a) cmp lc($b)} keys %data) { $output .= $data{$_}{'dt'} . $data{$_}{'dd'} . "<p></p>"; } # feed the string to a new Parser Object my $new_dl = HTML::TreeBuilder->new; $new_dl->parse($output); my $nu_aber = (); # remove unneccesary tags $nu_aber = $new_dl->guts(); # replace old dl with new dl $dl->delete_content(); $dl->push_content($nu_aber); my $hpp = new HTML::PrettyPrinter ( 'linelength' => 130, 'quote_attr' => 1, 'allow_forced_nl' => 1, 'entities' => "&<>äöüßÄÖÜ"); $hpp->set_force_nl(1,qw(body head table tr td)); $hpp->nl_before(2,qw(tr td p)); my $linearray_ref = $hpp->format($tree); print @{$linearray_ref}; $tree = $tree->destroy;
My main problem is to properly dereference the tree and to replace the DL part of the tree with a sorted array of HTML::Element Objects without having to create and parse code first.
Any hints greatly appreciated,
svenXY

Replies are listed 'Best First'.
Re: HTML::TreeBuilder: sort a Definition List (<dl>)
by Tanktalus (Canon) on Sep 12, 2005 at 18:59 UTC

    For playing with HTML, I prefer enforcing XHTML and playing with that, instead. Using XML::Twig. For example:

    use strict; use warnings; use XML::Twig; my $html_code = ' <html> <head> <title>Glossary</title> </head><body> <!-- had to close head, open body --> <h1>Glossary</h1> <dl> <dt><b>E Definition</b></dt> <dd>E - data</dd> <p></p> <dt><b>B Definition</b></dt> <dd>B - data</dd> <p></p> <dt><b>A_definition</b></dt> <dd>A data.</dd> <p></p> <dt><b>C definition</b></dt> <dd>C - data</dd> <p></p> </dl> </body> </html> '; my $twig = XML::Twig->new(pretty_print => 'indented'); $twig->parse($html_code); for my $dl ($twig->root()->get_xpath('//dl')) { my @entries; for my $el ($dl->children()) { $el->cut(); if ($el->gi() eq 'dt') { push @entries, [ $el ]; } else { push @{$entries[-1]}, $el; } } @entries = sort { $a->[0]->text() cmp $b->[0]->text() } @entries; for my $entry (@entries) { $_->paste(last_child => $dl) for @$entry; } print $dl->sprint(),"\n"; }
    Tested. The only caveat is if you start getting funky characters that aren't part of standard XML, e.g., "&copy;". Then it's a bit more work. Still doable, but more work.

Re: HTML::TreeBuilder: sort a Definition List (<dl>)
by skillet-thief (Friar) on Sep 12, 2005 at 19:25 UTC

    I'm not sure I understand your question. The code you already have seems to do most of the tricky stuff, ie. getting the data out of the html.

    If I were doing this (but I'm not fast enough to just whip out code right now), I think I would delete the <dt> and <dd> objects as I read them (there are a couple of methods for doing this, IIRC). Then I would sort them as HTML::Element objects, using a big Schwartzian Transform. Once you get an array of sorted HTML::Element objects, you can reattach the whole thing into the dl.

    Assuming that is what you wanted to do... ;-)

    Good luck.

    sub jf { print substr($_[0], -1); jf( substr($_[0], 0, length($_[0])-1)) if length $_[0] > 1; } jf('gro.alubaf@yehaf');

      ++skillet-thief, I agree with your design; the code at the bottom implements it. It is slightly more complex, to handle the tags other than DT and DD that can exist in the DL.

      Notable issues in the OP code:

      • $tree->destroy should be $tree->delete.
      • You use $tree->parse without using $tree->eof! From the HTML::TreeBuilder docs:
        $root->eof()
        This signals that you're finished parsing content into this tree; this runs various kinds of crucial cleanup on the tree. This is called for you when you call $root->parse_file(...), but not when you call $root->parse(...). So if you call $root->parse(...), then you must call $root->eof() once you've finished feeding all the chunks to parse(...), and before you actually start doing anything else with the tree in $root.
        Using new_from_content or new_from_file would also prevent the problem.
      • You say:
        my ($dl) = $tree->look_down('_tag', 'dl');
        This means "scan *everywhere* in $tree to find all the DL tags, and put the first DL tag found into $dl". Why ask for them all and take the first? Instead, ask for *only* the first DL, by calling look_down in scalar context.
        my $dl = $tree->look_down('_tag', 'dl');

      Working, tested code:

      #!/usr/bin/perl -W use strict; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_content(<<'END') or die; <html> <head> <title>Glossary</title> <h1>Glossary</h1> <dl> <dt><b>E Definition</b></dt> <dd>E - data</dd> <p></p> <dt><b>B Definition</b></dt> <dd>B - data</dd> <p></p> <dt><b>A_definition</b></dt> <dd>A data.</dd> <p></p> <dt><b>C definition</b></dt> <dd>C - data</dd> <p></p> </dl> </body> </html> END my $dl = $tree->look_down( _tag => 'dl' ); # Unlink all of $dl's children from $dl, and return them. my @dl_content = $dl->detach_content(); # Group the tags into an AoA on the DT tag. my @dt_tag_clusters; foreach (@dl_content) { push @dt_tag_clusters, [] if $_->tag() eq 'dt'; die "Tags occured before first DT" unless @dt_tag_clusters; push @{ $dt_tag_clusters[-1] }, $_; } # Sort the clusters @dt_tag_clusters = map { $_->[1] } sort { $a->[0] cmp $b->[0] } map { [ $_->[0]->as_HTML, $_ ] } @dt_tag_clusters; # Un-cluster the tags. @dl_content = map { @$_ } @dt_tag_clusters; # Replace the DL's content with the sorted tags. $dl->push_content( @dl_content ); print $tree->as_HTML; # or use HTML::PrettyPrinter $tree = $tree->delete();

        Thanks everybody!
        @Util - perfect! Exactly what I was looking for!
        I really liked the way you created the clusters. Then it took me some time to understand the map-sort-map (unitl I found it in the cookbook) and the un-clustering (well, I didn't really understand that one, but can take it as given).

        Not only did you solve my problem, but you also greatly enhanced my understanding of Perl and added to my toolbox of solutions to common problems!

        One small note though:
        Mapping like this: map  { [ $_->[0]->as_HTML, $_ ] } leads to problems when you have more tags in the dt element (some are links as well), thus it's better to map  { [ $_->[0]->as_text, $_ ] } or even to apply some more calculations on the text like lc and (at least in Germany) Umlaut considerations.

        More than happy,
        svenXY

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://491185]
Approved by marto
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (2)
As of 2024-03-19 07:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found