Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Translating Japanese to English with WWW::Babelfish

by isotope (Chaplain)
on Feb 22, 2001 at 00:54 UTC ( #60033=perlquestion: print w/replies, xml ) Need Help??
isotope has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a programming project where the code is documented in Japanese (S-JIS). Since most of us here can't read Japanese, we've been trying a lot of things to get it translated. My IDE is CodeWright, which conveniently includes a Perl interpreter for macros. My hope is to be able to highlight a Japanese string and translate it on the fly. I'm attempting to write a macro for it eventually, but for now I'm writing a standalone script:
#!perl -w use strict; use Jcode; my $DEBUG = 1; my $text = ''; for(<>) { if($DEBUG) { my ($code) = getcode($_); print "Chunk encoded as: " . $code . "\n"; } my $j = Jcode->new($_); $text .= $j->utf8 . "\n"; } print "\nText to send:\n" . $text . "\n" if $DEBUG; print "\nConnecting to translator... please wait.\n\n"; use WWW::Babelfish; my $obj = new WWW::Babelfish(); die( "Babelfish server unavailable\n" ) unless defined($obj); print "\nTranslating... this may take a loooong time.\n\n"; my $english = $obj->translate( source => 'Japanese', destination => 'English', text => $text, delimiter => '\n', ); print "\nTranslation: \n\n"; print $english; print "\n";
If it were an ASCII-friendly language like French, or German, I wouldn't have any trouble. But since it's Japanese, I figured I'd have to meddle with the encoding and put it in UTF-8 for Babelfish... I used Jcode to do this, but I'm not sure WWW::Babelfish is robust enough to handle the multi-byte encodings... any pointers would be appreciated...


Replies are listed 'Best First'.
From the author of WWW:Babelfish
by Tuna (Friar) on Feb 22, 2001 at 10:15 UTC
    The author is a friend of mine, currently traveling in Asia. We have been keeping in touch via email, since I'm meeting him in Thailand in a couple of weeks to go rock climbing :-). Anyway, I forwarded him your problem, and his response is below:

    Hmm... I've had a couple of bug reports about the Babelfish module recently. Possibly they've changed the format of their page again (gnashing of teeth). Fortunately someone sent me a patch; unfortunately it's virtually impossible for me to try it out, so I'm sending it along to you to give it a whirl if you want to. To try it, just replace with this. I don't think it's going to address your problem, though. When I try going to the Babelfish URL and entering text to be translated from English to Japanese, it spits back characters and not a Roman transliteration. If I then highlight and paste those characters back into the translation field at the URL (, it pastes them as Roman garbage and doesn't successfully translate anything. To be honest, I'm not sure how the whole character set thing works; this wasn't an issue when I first wrote the module since it only handled languages based on Roman text. Maybe it's as simple as putting a character set specifier into the HTML header before sending out the text? I'd love to work on this, but unfortunately I have no access to a Unix development environment at the moment. Of course, if you come up with a patch, send it along... Dan
    package WWW::Babelfish; use strict; use vars qw($VERSION @ISA @EXPORT @EXPORT_OK); require Exporter; require AutoLoader; @ISA = qw(Exporter AutoLoader); # Items to export into callers namespace by default. Note: do not expo +rt # names by default without a very good reason. Use EXPORT_OK instead. # Do not simply export all your public functions/methods/constants. @EXPORT = qw(); $VERSION = '0.09-bharat'; # Preloaded methods go here. use HTTP::Request::Common qw(POST); use LWP::UserAgent; use HTML::TokeParser; use IO::String; my $BABELFISH = ''; my $BABELFISH_URL = ' +yn?'; my $MAXCHUNK = 1000; # Maximum number of characters # Bablefish will translate at one time my $MAXRETRIES = 5; # Maximum number of retries for a chunk of text $| = 1; sub new { my ($this, @args) = @_; my $class = ref($this) || $this; my $self = {}; bless $self, $class; return undef unless( $self->initialize(@args) ); return $self; } sub initialize { my($self, %params) = @_;; # Caller can set user agent; we default to "script:WWW::Babelfish/ +0.01" $self->{agent} = $params{agent} || $0 . ":" . __PACKAGE__ . "/" . +$VERSION; $self->{proxy} = $params{proxy} if defined $params{proxy}; # Get the page my $ua = new LWP::UserAgent; $ua->proxy('http','http://' . $self->{proxy}) if defined $self->{p +roxy}; $ua->agent($self->{agent}); my $req = new HTTP::Request('GET' => $BABELFISH_URL); my $res = $ua->request($req); unless($res->is_success){ warn(__PACKAGE__ . ":" . $res->status_line); return 0; } my $page = $res->content; # Extract the mapping of languages to options to be passed back, # and store it on our object in "Lang2opt" hash # and extract language names and store on "Langs" hash my $p = HTML::TokeParser->new(\$page); my $a2b; if( $p->get_tag("select") ){ while( $_ = $p->get_tag("option") ){ $a2b = $p->get_trimmed_text; $self->{Lang2opt}->{$a2b} = $_->[1]{value}; $a2b =~ /(\S+)\sto\s(\S+)/; $self->{Langs}->{$1} = ""; $self->{Langs}->{$2} = ""; } } return 1; } sub languages { my $self = shift; return sort keys %{$self->{Langs}}; } sub translate { my ($self, %params) = @_; # Paragraph separator is "\n\n" by default $/ = $params{delimiter} || "\n\n"; undef $self->{error}; unless( exists($self->{Langs}->{$params{source}}) ){ $_ = "Language \"" . $params{source} . "\" is not available"; $self->{error} = $_; warn(__PACKAGE__ . ": " . $_); return undef; } unless( exists($self->{Langs}->{$params{destination}}) ){ $_ = "Language \"" . $params{destination} . "\" is not available" +; $self->{error} = $_; warn(__PACKAGE__ . ": " . $_); return undef; } # This "feature" is actually useful as a pass-thru filter. # Babelfish doesn't do same-to-same anyway (though it would be # pretty interesting if it did) return $params{text} if $params{source} eq $params{destination}; my $langopt = $self->{Lang2opt}->{$params{source} . " to " . $params +{destination}}; my $th; # "Text Handle" if( ref $params{text} ){ # We've been passed a filehandle $th = $params{text}; } else{ # We've been passed a string $th = new IO::String($params{text}); } my $Text = ""; my $WANT_STRING_RETURNED = 0; unless( defined $params{ofh} ){ $params{ofh} = new IO::String($Text); $WANT_STRING_RETURNED = 1; } # Variables we use in the next mega-block my $para; # paragraph my $num_paras = 0; # number of paragraphs my $transpara; # translated paragraph my $para_start_ws = ""; # initial whitespace in paragraph my $chunk; # paragraph piece to feed to babelfish my $req; # LWP request object my $ua; # LWP user agent my $res; # LWP result my $text; # translated chunk my $i; # a counter while($para = <$th>){ $num_paras++; $transpara = ""; # Extract any leading whitespace from the start of the paragraph # Babelfish will eat it anyway. if ($para =~ s/(^\s+)(\S)/$2/) { $para_start_ws = $1 || ""; } chomp $para; # Remove the para delimiter CHUNK: foreach $chunk ( $self->_chunk_text($MAXCHUNK, $para) ) { $req = POST ($BABELFISH_URL, [ 'doit' => 'done', 'urltext' => $c +hunk, 'lp' => $langopt, 'Submit' => 'Translate' ]); $ua = new LWP::UserAgent; $ua->proxy('http','http://' . $self->{proxy}) if defined $self-> +{proxy}; RETRY: for($i = 0; $i <= $MAXRETRIES; $i++){ $res = $ua->request($req); if( $res->is_success ){ $text = $self->_extract_text($res->as_string); next RETRY if $text =~ /^\*\*time-out\*\*/; # in-band signal +ling; yuck $text =~ s/\n$//; # Babelfish likes to append newlines $transpara .= $text; next CHUNK; } } $self->{error} = "Request timed out more than $MAXRETRIES times" +; return undef; } print { $params{ofh} } $/ if $num_paras > 1; print { $params{ofh} } $para_start_ws . $transpara; } close $params{ofh}; if( $WANT_STRING_RETURNED ){ return $Text; } else{ return 1; } } sub error { my $self = shift; return $self->{error}; } # Given a maximum chunk size and some text, return # an array of pieces of the text chopped up in a # logical way and less than or equal to the chunk size sub _chunk_text { my($self, $max, $text) = @_; my @result; # The trivial case return($text) if length($text) <= $max; # Hmmm. There are a couple of ways we could do this. # I'm guessing that Babelfish doesn't look at any structure larger + than # a sentence; in fact I'm often tempted to guess that it doesn't l +ook # at anything larger than a word, but we'll give it the benefit of + the doubt. my $total = length($text); my $offset = 0; my $lastoffset = 0; my $test; my $chunk; while( ($total - $lastoffset) > $max) { $test = $lastoffset + $max; # Split by terminal punctuation... @_ = sort {$b <=> $a} ( rindex($text, '.', $test), rindex($text, '!', $test), rindex($text, '?', $test), ); $offset = shift(@_) + 1; # or by clause... if( $offset == -1 or $offset <= $lastoffset ){ @_ = sort {$b <=> $a} ( rindex($text, ',', $test), rindex($text, ';', $test), rindex($text, ':', $test), ); $offset = shift(@_) + 1; # or by word if( $offset == -1 or $offset <= $lastoffset){ $offset = rindex($text, " ", $test); } # or give up return undef if $offset == -1; } $chunk = substr($text, $lastoffset, $offset - $lastoffset); push( @result, $chunk); $lastoffset = $offset; } push( @result, substr($text, $lastoffset) ); return @result; } # Extract the text from the html we get back from babelfish and return # it (keying on the fact that it's the first thing after a <br> tag, # possibly removing a textarea tag after it). sub _extract_text { my($self, $html) = @_; my $p = HTML::TokeParser->new(\$html); my ($tag,$token); my $text=""; if ($tag = $p->get_tag('br')) { while ($token = $p->get_token) { if (shift(@{$token}) ne "T") { if (defined($text) and $text ne "") { last; } else { next; } } $text .= shift(@{$token}); #$text =~ s/[\r\n]//g; # This patch for whitespace handling from Olivier Scherler $text =~ s/[\r\n]/ /g; $text =~ s/^\s*//; $text =~ s/\s+/ /g; $text =~ s/\s+$//; } } return $text; } # Autoload methods go after =cut, and are processed by the autosplit p +rogram. 1; __END__ =head1 NAME WWW::Babelfish - Perl extension for translation via babelfish =head1 SYNOPSIS use WWW::Babelfish; $obj = new WWW::Babelfish( 'agent' => 'Mozilla/8.0', 'proxy' => 'myp +roxy' ); die( "Babelfish server unavailable\n" ) unless defined($obj); $french_text = $obj->translate( 'source' => 'English', 'destination' => 'French', 'text' => 'My hovercraft is full of +eels', 'delimiter' => "\n\t", 'ofh' => \*STDOUT ); die("Could not translate: " . $obj->error) unless defined($french_te +xt); @languages = $obj->languages; =head1 DESCRIPTION Perl interface to the WWW babelfish translation server. =head1 METHODS =over 4 =item new Creates a new WWW::Babelfish object. It can take a named argument for user agent. =item languages Returns a plain array of the languages available for translation. =item translate Translates some text using Babelfish. Parameters: source: Source language destination: Destination language text: If this is a reference, translate interprets it as an open filehandle to read from. Otherwise, it is treated as a string to translate. delimiter: Paragraph delimiter for the text; the default is "\n\n". Note that this is a string, not a regexp. ofh: Output filehandle; if provided, the translation will be written to this filehandle. If no ofh parameter is given, translate will return the text; otherwis +e it will return 1. On failure it returns undef. =item error Returns a (hopefully) meaningful error string. =back =head1 NOTES Babelfish translates 1000 characters at a time. This module tries to break the source text into reasonable logical chunks of less than 1000 characters, feeds them to Babelfish and then reassembles them. Formatting may get lost in the process. =head1 AUTHOR Dan Urist, =head1 SEE ALSO perl(1). =cut ~
    HTH, Steve
      *light bulb illuminates*
      I re-examined the Babelfish form HTML source... I was thinking it was specifying utf8 as the form encoding type, and was preparing to patch to use that encoding, but when I looked more carefully, I realized what I had seen was a hidden field... enc=utf8.
      Patching as follows does the job, as long as the input has been converted to utf8:
      --- WWW/ Wed Feb 21 10:33:42 2001 +++ WWW/ Thu Feb 22 08:56:09 2001 @@ -148,7 +148,7 @@ CHUNK: foreach $chunk ( $self->_chunk_text($MAXCHUNK, $para) ) { - $req = POST ($BABELFISH_URL, [ 'doit' => 'done', 'urltext' => $ +chunk, 'lp' => $langopt, 'Submit' => 'Translate' ]); + $req = POST ($BABELFISH_URL, [ 'doit' => 'done', 'urltext' => $ +chunk, 'lp' => $langopt, 'Submit' => 'Translate', 'enc' => 'utf8' ]); $ua = new LWP::UserAgent; $ua->proxy('http','http://' . $self->{proxy}) if defined $self- +>{proxy};

        Could you please help me to run your scripts ? I have no experience in Perl, but I managed to install Perl, download your scripts, install Jcode and Babelfish. But when I substitute Babelfish with your script it gives lots of errors. My internet is through a proxy connection so it says it has no internet connection, this is first problem I think, then I don't understand how you input the files that need to be processed. Thanks!
Re: Translating Japanese to English with WWW::Babelfish
by Anonymous Monk on Feb 22, 2001 at 01:04 UTC
    I think you should rethink using Babelfish to translate. Although it (babelfish) works well with most western european languages, it really REALLY fouls up sentences when it comes to translating moderate Japanese.
      In this case, Babelfish has worked quite well, since I'm only translating short phrases (comments). I'm just trying to shortcut the process of opening source files in Word then copying and pasting into IE, the combination of which is the only way I've successfully fed the Shift-JIS from the source files to Babelfish as UTF-8... Perl seems like the perfect tool for this job.

        watashi wa nihon ga suki des
Re: Translating Japanese to English with WWW::Babelfish
by Anonymous Monk on Jul 03, 2002 at 11:54 UTC
Re: Translating Japanese to English with WWW::Babelfish
by Anonymous Monk on Mar 15, 2003 at 22:51 UTC
    3年式 ワインレッド 5速 検なし 社外エアロ 足廻り マフラ[ アルミ タワ[バ[ タイベル交換済み 他


      3 year type wine lead 5 fast inspection it is not imitation aero underside muffler aluminum tower bar tie bell intersection exchanging being completed other things

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://60033]
Approved by root
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (7)
As of 2018-06-24 23:21 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.