Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

A year or so ago, I started working on a project we inherited from our Japanese parent company. As part of that, we received a substantial amount of source code, all commented in Japanese. We tried several ways to translate that source code, but the translations were mostly non-sensical and usually ended up breaking the code.

After someone in the company pointed out that we could translate comments individually by opening the source in Microsoft Word, then copying and pasting into Internet Explorer (so the multi-byte character encoding would be handled properly), we could use Babelfish to translate. The light bulb turned on and I started working on a program to translate entire files.

For any CodeWright users out there, I discovered that CW's built-in perl is too broken to mess with, so I abandoned the macro idea. Instead, I focused on translating the entire file from the command line.

In the last few weeks, I finally figured out how to translate without horribly disfiguring the code. Since Japanese phrases consist of a sequence of characters with no spaces, I focus on chunks of non-whitespace for the translation, and replace the translated text in-place, thus avoiding the whitespace-munging Babelfish unfortunately performs. This also made it really easy to build a translation dictionary to avoid repeat lookups on Babelfish.

I developed the program with Jcode version 0.68 and a patched version of WWW::Babelfish 0.09. The Patch.

#!perl -w use strict; ###################################################################### +##### # jtoeng.pl # # A Japanese to English file translator # by Brett T. Warden # NEC Eluminant Technologies, Inc. # Created 21 February 2001 # Lastmod 15 October 2001 ###################################################################### +##### use Jcode; use WWW::Babelfish; use Storable qw(nfreeze thaw); use File::Basename; #use Data::Dumper; my $DEBUG = 0; print "\nConnecting to translator... please wait.\n\n" if $DEBUG; my $babel = new WWW::Babelfish(); die( "Babelfish server unavailable\n" ) unless defined($babel); my %dict; if(open(DICT, '< jtoeng.dict')) { binmode(DICT); local($/); my $frozen = <DICT>; if(my $ref = thaw($frozen)) { # Yeah this is inefficient. Ideally I'd use a DB anyway. %dict = %{$ref}; } close(DICT); } if(@ARGV) { ARG: for(@ARGV) { print "Trying to read $_\n"; if(open(IFILE, "<" . $_)) { binmode(IFILE); my ($name, $path, $suffix) = fileparse($_, '\..*'); my $outfile = $path . $name . '.english' . $suffix; print "Preparing $outfile\n"; if(open(OFILE, ">" . $outfile)) { binmode(OFILE); my $fh = select(OFILE); $| = 1; select($fh); my $TRANSLATIONS = 0; my $BABELFISHINGS = 0; # Translate print "Translating $_\n"; translate(\*IFILE, \*OFILE, \$TRANSLATIONS, \$BABELFIS +HINGS); close(OFILE); print "Performed $TRANSLATIONS translations, of which +$BABELFISHINGS were directly requested from Babelfish.\n"; print "Translation complete\n\n"; } else { die "Unable to write $outfile: $!\n"; } close(IFILE); } else { warn "Unable to read $_: $!\n"; next ARG; } } } else { my $TRANSLATIONS = 0; my $BABELFISHINGS = 0; translate(\*STDIN, \*STDOUT, \$TRANSLATIONS, \$BABELFISHINGS); print "Performed $TRANSLATIONS translations, of which $BABELFISHIN +GS were directly requested from Babelfish.\n" if $DEBUG; } sub translate { my $IFH = shift or return; my $OFH = shift or return; my $TRANSLATIONS = shift; my $BABELFISHINGS = shift; LINE: while(my $text = <$IFH>) { # If it's ascii, then it doesn't need to be translated? my $code = getcode($text) || ''; print "Line coding: $code\n" if($code and $DEBUG); unless($code eq 'ascii') { if($code) { # Not ascii, run through Jcode. my $j = Jcode->new($text); $text = $j->utf8; } my @chunks = $text =~ m!(\S+)!g; CHUNK: for(@chunks) { my $chunk = $_; my $chunk_code = getcode($chunk) || ''; next CHUNK if($chunk_code and ($chunk_code eq 'ascii') +); $chunk =~ s!^//!!; $chunk =~ s!^#+!!; $chunk =~ s!^/\*+!!; $chunk =~ s!\*/$!!; print "Chunk: $chunk\n" if $DEBUG; my $trans; if(exists($dict{$chunk})) { if(defined($dict{$chunk})) { $trans = $dict{$chunk}; print "Dictionary: $chunk = $trans\n" if $DEBU +G; $text =~ s!\Q$chunk!$trans!; $$TRANSLATIONS++ if $TRANSLATIONS; } else { print "Skipping $chunk -- translation failed p +reviously.\n" if $DEBUG; } } else { print "\n" if $DEBUG; print "Translating: $chunk\n" if $DEBUG; $trans = $babel->translate( source => 'Japanese', destination => 'English', text => $chunk, delimiter => "\n", ); if(defined($trans)) { # Replace those annoying &nbsp;s that Babelfis +h loves. $trans =~ s!&nbsp;! !g; chomp $trans; if($trans =~ m!^\s*$!) { # Babelfish returned nothing. print "No useful translation returned.\n" +if $DEBUG; sleep 2 if $DEBUG; # Make an entry in the dict in case somebo +dy # wants to try to translate it later. $dict{$chunk} = undef; $chunk = ''; $trans = ''; } else { $$TRANSLATIONS++ if $TRANSLATIONS; $$BABELFISHINGS++ if $BABELFISHINGS; if($chunk ne $trans) { # Answer looks useful. Use it and kee +p it. $text =~ s!\Q$chunk!$trans!; $dict{$chunk} = $trans; print "Translated\n\t$chunk\nto\n\t$tr +ans\n" if $DEBUG; } else { # Store a placeholder in the dict # so we don't waste time sending it to # Babelfish again $dict{$chunk} = undef; print "Babelfish returned what we sent + it.\n" if $DEBUG; } if((my $freeze = nfreeze(\%dict)) and open(DICT, "> jtoeng.dict")) { binmode(DICT); print DICT $freeze; close(DICT); } } } else { warn "Lookup on $chunk failed.\n"; } } } } print "\n\n<" . '-' x 79 . "\n" if $DEBUG; print $text if $DEBUG; print '-' x 79 . ">\n\n" if $DEBUG; print $OFH $text; } return; }

Adding the dictionary is a bit of a hack, as I just used Storable. An enhancement would be to use a database instead, providing concurrency protection and speed improvements. The current approach, however, requires much less user setup.



--isotope
http://www.skylab.org/~isotope/

Edit - Petruchio Sun Oct 21 10:30:26 UTC 2001: Added READMORE tag.


In reply to Translating source code from Japanese to English using WWW::Babelfish by isotope

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others meditating upon the Monastery: (8)
    As of 2014-07-11 04:53 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      When choosing user names for websites, I prefer to use:








      Results (218 votes), past polls