Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Converting words to numbers with HOP::Parser

by Ovid (Cardinal)
on Nov 05, 2005 at 21:25 UTC ( #506028=CUFP: print w/ replies, xml ) Need Help??

There are a number of modules which convert numbers to words. I decided to try to reverse this process. The following is buggy and only handles numbers into the millions, but give the word_to_num function the string 'two hundred and ninety six million four hundred and twenty-two thousand five hundred and seventy eight point three four seven' and it will happily return 296422578.347.

This code is based on the parser outlined in "Higher Order Perl" by Dominus. Note that the docs for HOP::Parser are incomplete and you should refer to the book. It assumes the following grammar (there's a bug in the way it handles "zero". I should fix that at some point but this was just a quick hack).

# there's some looseness here to account for people's mangling # of the English language entire_input ::= numbers 'End_Of_Input' numbers ::= [ NEGATIVE ] millions | thousands | hundreds | tens | teens | on +es [ 'POINT' teens | ones { ones } ] millions ::= [ hundreds | teens | tens | ones ] 'MILLION' [ thousands | hundreds | teens | tens | ones ] thousands ::= [ hundreds | teens | tens | ones ] 'THOUSAND' [ hundreds | teens | tens | ones ] hundreds ::= [ teens | tens | ones ] 'HUNDRED' [ teens | tens | +ones ] tens ::= twenty | thirty | forty | fifty | sixty | seventy | eighty | ninety teens ::= ten | eleven | twelve | thirteen | fourteen | fifteen | sixteen | seventeen | eighteen | nineteen ones ::= zero | one | two | three | four | five | six | seven | eight | nine

Cheers,
Ovid

New address of my CGI Course.

And the code:

package Number::FromWord; use warnings; use strict; use HOP::Stream qw/iterator_to_stream/; use HOP::Lexer qw/make_lexer/; use HOP::Parser qw/:all/; use Regexp::Assemble; use Scalar::Util qw/looks_like_number/; use base 'Exporter'; our @EXPORT_OK = 'word_to_num'; my %one_num_for = ( zero => 0, one => 1, two => 2, three => 3, four => 4, five => 5, six => 6, seven => 7, eight => 8, nine => 9, ); my %teen_num_for = ( ten => 10, eleven => 11, twelve => 12, thirteen => 13, fourteen => 14, fifteen => 15, sixteen => 16, seventeen => 17, eighteen => 18, nineteen => 19, ); my %ten_num_for = ( twenty => 20, thirty => 30, forty => 40, fifty => 50, sixty => 60, seventy => 70, eighty => 80, ninety => 90, ); # # set up the lexer # my $ones_re = Regexp::Assemble->new; foreach my $number ( keys %one_num_for ) { $ones_re->add($number); } my $teens_re = Regexp::Assemble->new; foreach my $number ( keys %teen_num_for ) { $teens_re->add($number); } my $tens_re = Regexp::Assemble->new; foreach my $number ( keys %ten_num_for ) { $tens_re->add($number); } my @tokens = ( [ 'AND', qr/\band\b/i, sub { } ], [ 'MILLION', qr/million/i, \&number ], [ 'THOUSAND', qr/thousand/i, \&number ], [ 'HUNDRED', qr/hundred/i, \&number ], [ 'TENS', qr/$tens_re+/i, \&number ], [ 'TEENS', qr/$teens_re+/i, \&number ], [ 'ONES', qr/$ones_re+/i, \&number ], [ 'NEGATIVE', qr/negative/i ], [ 'POINT', qr/point/i ], [ 'SPACE', qr/[ -]/i, sub { } ], ); sub number { my ( $label, $number_as_word ) = @_; return [ $label, lc $number_as_word ]; } sub word_to_num { my @word = shift; my $lexer = make_lexer( sub { shift @word }, @tokens ); return parse($lexer); } # # set up the parser # my ( $numbers, $negative, $point, $ones, $teens, $tens, $hundreds, $thousands, $millions ); my $Point = parser { $point->(@_) }; my $Ones = parser { $ones->(@_) }; my $Teens = parser { $teens->(@_) }; my $Tens = parser { $tens->(@_) }; my $Hundreds = parser { $hundreds->(@_) }; my $Thousands = parser { $thousands->(@_) }; my $Millions = parser { $millions->(@_) }; my $Negative = parser { $negative->(@_) }; my $Numbers = parser { $numbers->(@_) }; # entire_input ::= numbers 'End_Of_Input' my $entire_input = T( concatenate( $Numbers, \&End_of_Input ), sub { s +hift } ); # numbers ::= [ NEGATIVE ] millions | thousands | hundreds | t +ens | teens | ones # [ 'POINT' teens | ones { ones } ] $numbers = T( concatenate( optional( lookfor('NEGATIVE') ), alternate( $Millions, $Thousands, $Hundreds, $Tens, $Teens, $O +nes ), optional( concatenate( absorb( lookfor('POINT') ), alternate( $Teens, concatenate( $Ones, star($Ones) ) ) ) ), ), sub { my ( $neg, $num, $point ) = @_; $point = $point->[0]; if ( looks_like_number($point) ) { # we have teens $num .= ".$point"; } elsif ( defined $point->[0] ) { # we have ones my $decimal = $point->[0]; if ( my @points = @{ $point->[1] } ) { $decimal .= join '', @points; } $num .= ".$decimal"; } if (@$neg) { $num *= -1; } return $num; } ); # millions ::= [ hundreds | teens | tens | ones ] # 'MILLION' # [ thousands | hundreds | teens | tens | ones ] $millions = T( concatenate( optional( alternate( $Hundreds, $Teens, $Tens, $Ones ) ), absorb( lookfor('MILLION') ), optional( alternate( $Thousands, $Hundreds, $Teens, $Tens, $On +es ) ) ), sub { my ( $million, $thousands ) = @_; $million = $million->[0] || 1; $thousands = $thousands->[0] || 0; return $million . sprintf "%06d", $thousands; } ); # thousands ::= [ hundreds | teens | tens | ones ] # 'THOUSAND' # [ hundreds | teens | tens | ones ] $thousands = T( concatenate( optional( alternate( $Hundreds, $Teens, $Tens, $Ones ) ), absorb( lookfor('THOUSAND') ), optional( alternate( $Hundreds, $Teens, $Tens, $Ones ) ) ), sub { my ( $thousand, $hundreds ) = @_; $thousand = $thousand->[0] || 1; $hundreds = $hundreds->[0] || 0; return $thousand . sprintf "%03d", $hundreds; } ); # hundreds ::= [ teens | tens | ones ] 'HUNDRED' [ teens | tens +| ones ] $hundreds = T( concatenate( optional( alternate( $Teens, $Tens, $Ones ) ), absorb( lookfor('HUNDRED') ), optional( alternate( $Teens, $Tens, $Ones ) ) ), sub { my ( $hundred, $tens ) = @_; $hundred = $hundred->[0] || 1; $tens = $tens->[0] || 0; return $hundred . sprintf "%02d", $tens; } ); # tens ::= 'TENS' [ ones ] $tens = T( concatenate( lookfor('TENS'), optional($Ones) ), sub { my ( $ten_word, $one_num ) = @_; my $ten_num = $ten_num_for{$ten_word}; $ten_num += $one_num->[0] || 0; return $ten_num; } ); # teens ::= 'TEENS' $teens = T( lookfor('TEENS'), sub { $teen_num_for{ $_[0] } } ); # ones ::= 'ONES' $ones = T( lookfor('ONES'), sub { $one_num_for{ $_[0] } } ); sub parse { my $stream = iterator_to_stream(shift); my ( $results, $remainder ) = eval { $entire_input->($stream) }; return $results; } 1;

This code also relies on the following functions with which I've extended the HOP::Parser: &absorb, &optional.

You didn't think I'd leave out tests, did you? Of course, they are rather incomplete.

#!perl #use Test::More tests => 1; use Test::More qw/no_plan/; BEGIN { chdir 't' if -d 't'; unshift @INC, '../lib'; use_ok( 'Number::FromWord', 'word_to_num' ); } can_ok __PACKAGE__, 'word_to_num'; my @words = ( 'three' => 3, 'fourteen' => 14, 'forty' => 40, 'forty-two' => 42, 'ninety nine' => 99, 'hundred' => 100, 'hundred and twenty' => 120, 'hundred and two' => 102, 'negative hundred and two' => -102, 'hundred ninety-nine' => 199, 'hundred nineteen' => 119, 'two hundred and three' => 203, 'twelve hundred and seventy three', => 1273, 'thousand' => 1000, 'seven hundred thirty-three thousand five hundred and twenty-nine' +, 733529, 'seventy three point two four five' => 73.245, 'three point seventeen' => 3.17, 'seven million' => 7000000, 'two hundred and ninety six million four hundred and twenty-two th +ousand five hundred and seventy eight', 296_422_578, 'two hundred and ninety six million four hundred and twenty-two th +ousand five hundred and seventy eight point three four seven', 296422578.347, ); while (@words) { my ( $word, $num ) = splice @words, 0, 2; is word_to_num($word), $num, "... $word should be $num"; }

Comment on Converting words to numbers with HOP::Parser
Select or Download Code
Re: Converting words to numbers with HOP::Parser
by Zaxo (Archbishop) on Nov 05, 2005 at 21:46 UTC

    ++Fun! It looks pretty easy to extend to "dozen", "score", and "gross".

    After Compline,
    Zaxo

      I've been thinking about this and wondering about the grammar. It should be easy to extend, but is "one hundred and ten dozen" allowable? If so, it should equal "one hundren and six score", but both seem a bit odd.

      If you want a real challenge, try to add fractions. Should "three and nine-halves" be allowed? What about "one half million and thirteen hundred"? Mixing real numbers should probably be allowed, though. Allowing "3.5 million" seems sensible.

      Cheers,
      Ovid

      New address of my CGI Course.

        If so, it should equal "one hundren and six score", but both seem a bit odd.

        Or just "eleven score."

        I would think that, if dozen was used, it would be the highest denomination (and similarly for score.) So "one hundred and ten dozen" should probably be "eighteen dozen and four."

        Update: Then again, in deciding on what to accept as input, you should probably be liberal as possible and accept even bizarre things like "one thousand one hundred twenty score eight dozen and four" for 1500.

        -sauoq
        "My two cents aren't worth a dime.";
        
Re: Converting words to numbers with HOP::Parser
by eric256 (Parson) on Nov 06, 2005 at 05:08 UTC

    I'm not sure if the post was just showing off HOP::Parser or what but I hacked together the following code just based on intuition. hehe. Tried to figure out exactly how i processed the numbers and make the computer do that.

    use strict; use warnings; use Data::Dumper; use Test::More qw/no_plan/; my $nums_to_value = {}; my $i = 0; for (qw/zero one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eigh +tteen nineteen/) { $nums_to_value->{$_} = $i++; }; $i = 20; for (qw/twenty thirty forty fifty sixty seventy eighty ninety/) { $nums_to_value->{$_} = $i; $i += 10; } my $mag = { hundred => 100, thousand => 1_000, million => 1_000_000, billion => 1_000_000_000, }; #print Dumper($nums_to_value); sub num { my $word = shift; if (index( $word, '-') > 0) { my ($first,$second) = split /-/, $_; return undef unless exists $nums_to_value->{$first} and exists $nums_to_value->{$second}; return $nums_to_value->{$first} + $nums_to_value->{$second}; } else { return undef unless exists $nums_to_value->{$word}; return $nums_to_value->{$word}; } } sub word_to_decimal { my @words = @_; my $out; my $i = 0; for (@words) { my $temp = num($_); $i += length($temp); $out += $temp / 10**$i; } return $out; } sub word_to_num { my $text = shift; my @words = split / /, $text; my @que; my $scratch = 0; my $mod = 1; if ($words[0] eq 'negative') { $mod = -1; shift @words; } while (@words) { $_ = shift @words; my $temp = num($_); if (defined $temp) { $scratch += $temp; } elsif($_ eq 'and') { } elsif($_ eq 'point') { push @que, word_to_decimal(@words); @words = (); } else { my $magnitude = $mag->{$_}; my $sum = $scratch; while (@que && ($sum + $que[-1]) < $magnitude) { $sum += pop @que; } my $temp = ($sum || 1) * $magnitude; push @que, $temp; $scratch = 0; } } push @que, $scratch; my $out = sum(@que); return $mod * $out; } sub sum { my $temp = 0; $temp += $_ for @_; return $temp; } my $words = { 'three' => 3, 'forty' => 40, 'hundred' => 100, 'fourteen' => 14, 'thousand' => 1000, 'forty-two' => 42, 'ninety nine' => 99, 'hundred and two' => 102, 'hundred nineteen' => 119, 'hundred and twenty' => 120, 'hundred ninety-nine' => 199, 'two hundred and three' => 203, 'one thousand two hundred' => 1_200, 'one point one zero zero one' => 1.1001, 'negative hundred and three' => -103, 'twelve hundred and seventy three', => 1273, 'seven hundred thirty-three thousand' => 733_000, 'seven hundred thirty-three thousand five hundred', 733500, 'seven hundred thirty-three thousand five hundred and twenty-nine' +, 733529, 'seventy three point two four five' => 73.245, 'three point seventeen' => 3.17, 'seven million' => 7000000, 'two hundred and ninety six million four hundred and twenty-two th +ousand five hundred and seventy eight', 296_422_578, 'two hundred and ninety six million four hundred and twenty-two th +ousand five hundred and seventy eight point three four seven', 296422578.347, }; for my $word (sort {length $a <=> length $b} keys %$words) { my $num = $words->{$word}; is word_to_num($word), $num, "... $word should be $num"; }

    It passes all your tests, but i'm not sure that means it will pass everything. Anyway just felt like hacking it together. ;)

    Update: Turns out there IS a module for this. Still fun anyway and we have three completly different methods. Lingua::EN::Words2Nums


    ___________
    Eric Hodges $_='y==QAe=e?y==QG@>@?iy==QVq?f?=a@iG?=QQ=Q?9'; s/(.)/ord($1)-50/eigs;tr/6123457/- \/|\\\_\n/;print;

      I missed that module when I was searching for it on the CPAN. Bummer.

      I find it interesting what extremely different styles our solutions have. Your's is quite procedural. Using a parser is declarative. What I find interesting about the latter is how one can pretty much rearrange all of the declarative portions of the parser and things still work as intended. This gives one a lot of flexibility.

      It's also (relatively) easy to extend. If I want to turn it into a calculator which responds appropriately to "divide three by seven and add twelve point two", I can do that. Further, it would merely build on what's there and nothing else would need to be touched. I wonder how feasible this is with a procedural model.

      In any event, your speculation was correct. My post was primarily intended to show off the power of the HOP::Parser as opposed to "hey, let's turn words into numbers!"

      Cheers,
      Ovid

      New address of my CGI Course.

        Well with a quick regex you should be able to just grab out portions of text that it will be able to match to numbers. Then you could turn something like "divide three by seven and add twelve point two" into "divide 3 by 7 and 12.2", then you just have to manage the new portions. Frankly the parser version makes no sense at all to me and I have no idea what it is doing at all. A quick (realy quick) explanation might go along way for those of us who haven't read HOP, though I get more and more curious about HOP as these posts come out.


        ___________
        Eric Hodges $_='y==QAe=e?y==QG@>@?iy==QVq?f?=a@iG?=QQ=Q?9'; s/(.)/ord($1)-50/eigs;tr/6123457/- \/|\\\_\n/;print;
Re: Converting words to numbers with HOP::Parser
by Anonymous Monk on Mar 29, 2012 at 02:01 UTC

    What to do?

    ok 1 - use Number::FromWord; ok 2 - main->can('word_to_num') ok 3 - ... three should be 3 ok 4 - ... fourteen should be 14 ok 5 - ... forty should be 40 not ok 6 - ... forty-two should be 42 # Failed test '... forty-two should be 42' # at t\01.t line 42. # got: undef # expected: '42' not ok 7 - ... ninety nine should be 99 # Failed test '... ninety nine should be 99' # at t\01.t line 42. # got: undef # expected: '99' ok 8 - ... hundred should be 100 not ok 9 - ... hundred and twenty should be 120 # Failed test '... hundred and twenty should be 120' # at t\01.t line 42. # got: undef # expected: '120' not ok 10 - ... hundred and two should be 102 # Failed test '... hundred and two should be 102' # at t\01.t line 42. # got: undef # expected: '102' not ok 11 - ... negative hundred and two should be -102 # Failed test '... negative hundred and two should be -102' # at t\01.t line 42. # got: undef # expected: '-102' not ok 12 - ... hundred ninety-nine should be 199 # Failed test '... hundred ninety-nine should be 199' # at t\01.t line 42. # got: undef # expected: '199' not ok 13 - ... hundred nineteen should be 119 # Failed test '... hundred nineteen should be 119' # at t\01.t line 42. # got: undef # expected: '119' not ok 14 - ... two hundred and three should be 203 # Failed test '... two hundred and three should be 203' # at t\01.t line 42. # got: undef # expected: '203' not ok 15 - ... twelve hundred and seventy three should be 1273 # Failed test '... twelve hundred and seventy three should be 1273' # at t\01.t line 42. # got: undef # expected: '1273' ok 16 - ... thousand should be 1000 not ok 17 - ... seven hundred thirty-three thousand five hundred and t +wenty-nine should be 733529 # Failed test '... seven hundred thirty-three thousand five hundred +and twenty-nine should be 733529' # at t\01.t line 42. # got: undef # expected: '733529' not ok 18 - ... seventy three point two four five should be 73.245 # Failed test '... seventy three point two four five should be 73.24 +5' # at t\01.t line 42. # got: undef # expected: '73.245' not ok 19 - ... three point seventeen should be 3.17 # Failed test '... three point seventeen should be 3.17' # at t\01.t line 42. # got: undef # expected: '3.17' not ok 20 - ... seven million should be 7000000 # Failed test '... seven million should be 7000000' # at t\01.t line 42. # got: undef # expected: '7000000' not ok 21 - ... two hundred and ninety six million four hundred and tw +enty-two thousand five hundred and seventy eight should be 296422578 # Failed test '... two hundred and ninety six million four hundred a +nd twenty-two thousand five hundred and seventy eight should be 29642 +2578' # at t\01.t line 42. # got: undef # expected: '296422578' not ok 22 - ... two hundred and ninety six million four hundred and tw +enty-two thousand five hundred and seventy eight point three four sev +en should be 296422578.347 # Failed test '... two hundred and ninety six million four hundred a +nd twenty-two thousand five hundred and seventy eight point three fou +r seven should be 296422578.347' # at t\01.t line 42. # got: undef # expected: '296422578.347' 1..22 # Looks like you failed 15 tests of 22.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://506028]
Approved by shenme
Front-paged by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (17)
As of 2014-12-18 16:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (58 votes), past polls