Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Temporarily strip HTML (solved!)

by epoptai (Curate)
on Jul 20, 2002 at 00:54 UTC ( #183517=note: print w/ replies, xml ) Need Help??


in reply to Temporarily strip HTML

Despite the title of this reply i'm still interested in any more guidance on how to do this properly. Since jeffa asked, this problem involves giving framechat the ability to translate the chatterbox in real time using babelfish.

I couldn't figure out how to implement an HTML::TreeBuilder solution so i went with DamnDirtyApe's code because it works and resembles the way i had tried and failed to solve the problem. His example works as posted but had problems when tested in the wild. So it evolved into the following working example. Two problems: i had to use two global variables, and the links break sometimes when the translation mixes up the order of the html placeholders. Testing with the live chatterbox today revealed that the second problem is relatively rare, but i'm still thinking about how to prevent it.

This test script translates a small sentence of English into a random European language while preserving the HTML:

#!/usr/bin/perl -w use strict; use WWW::Babelfish; use CGI 'header'; use Data::Dumper; my %html = (); my $C = 1; my @langs = qw(German French Spanish Italian Portuguese); shuffle(\@langs); my $str = qq~This <b>contains</b> both text and <a href="http://www.w3c.org"> +html</a>.~; $str = translate($str); print "$langs[0]: ", $str, '<pre>', Dumper(\%html); sub translate { my ($txt) = @_; $txt =~ s|<([^>]+)>|savetags($1)|eg; # weak parsing :-/ my $obj = new WWW::Babelfish( 'agent' => 'DeBabelizer' ); return $_[0] unless defined($obj); my $ttxt = $obj->translate( 'source' => 'English', 'destination' => $langs[0], 'text' => $txt ); return $_[0] unless defined($ttxt); $ttxt = encode($ttxt); # replace placeholders with corresponding html # need ;? cause the fish sometimes screws up that colon $ttxt =~ s|\&lt\;(\d+)\&gt\;?|<$html{$1}>|g; return $ttxt } sub savetags { # replace html tag content with placeholders my $htm = pop; $html{$C} = $htm; # ack, global $_ = '&lt;'.$C.'&gt;'; $C++; # global return $_ } ## and two third-party subs that add to the fun sub shuffle { # Perl Cookbook recipe 4.17 my $array = shift; for(my$i = @$array; --$i;){ my$j = int rand ($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i] } } sub encode { # UTF-8 to latin1 regex from XML::TiePYX (thanks to mirod) my($text) = @_; $text =~ s{([\xc0-\xc3])(.)}{ my $hi = ord($1); my $lo = ord($2); chr((($hi & 0x03) <<6) | ($lo & 0x3F)) }ge; return $text; }
Thanks to everyone who replied to my original query!

--
Check out my Perlmonks Related Scripts like framechat, reputer, and xNN.


Comment on Temporarily strip HTML (solved!)
Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://183517]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2014-12-25 01:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (159 votes), past polls