Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Converting tesseract box data into 2d grid

by Anonymous Monk
on Jan 28, 2015 at 23:43 UTC ( [id://1114833]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm using Tesseract to OCR images to read puzzle data. The problem is that Tesseract squashes whitespace so messes with the positions of the found characters. It does have an option to output the bounding box coordinates for each of the characters, but I'm not sure how to convert that back to a row/column to produce the original puzzle positions. An example input image: https://i.imgur.com/dspLfTI.png and the resulting box data is:
N 68 115 79 127 0 K 120 115 128 127 0 A 145 115 155 127 0 L 246 115 253 127 0 B 46 91 54 103 0 I 69 91 77 103 0 C 95 91 103 103 0 C 119 91 127 103 0 Y 145 91 154 103 0 D 169 91 179 103 0 I 195 91 203 103 0 T 218 91 228 103 0 Z 245 91 255 103 0 I 269 91 277 103 0 I 45 65 49 77 0 I 45 65 54 77 0 O 68 65 79 77 0 L 96 65 103 77 0 I 119 65 127 77 0 E 146 65 153 77 0 D 169 65 179 77 0 P 197 65 204 77 0 B 220 65 228 77 0 H 245 65 255 77 0 O 268 65 279 77 0 S 295 65 304 77 0 I 45 41 53 53 0 I 69 41 77 53 0 L 96 41 103 53 0 V 120 41 129 53 0 E 146 41 153 53 0 V 170 41 179 53 0 N 194 41 205 53 0 U 219 41 229 53 0 Y 245 41 254 53 0 Z 269 41 279 53 0 L 296 41 303 53 0 T 18 15 28 27 0 S 45 15 54 27 0 E 70 15 77 27 0 C 95 15 103 27 0 I 119 15 123 27 0 I 119 15 128 27 0 E 146 15 153 27 0 N 168 15 179 27 0 E 196 15 203 27 0 O 218 15 229 27 0 T 244 15 254 27 0 Y 269 15 278 27 0 U 295 15 305 27 0 E 320 15 327 27 0
(Note that tesseract assigns 0,0 to the lower left corner)

I can parse that using something like:

for my $line (split "\n", $boxdata) { my ($chr, $x1, $y1, $x2, $y2, $page) = $line =~ m{ ^ (\S) \ (\d+) \ (\d+) \ (\d+) \ (\d+) \ (\d+) $ }x; }
But I need to figure out how to convert that to this:
my @grid = ( [split '', ' N KA L '], [split '', ' BICCYDITZI '], [split '', ' ROLIEDPBHOS '], [split '', ' IILVEVNUYZL '], [split '', 'TSECRENEOTYUE'], ); p @grid; [ [0] [ [0] " ", [1] " ", [2] "N", [3] " ", [4] "K", [5] "A", [6] " ", [7] " ", [8] " ", [9] "L", [10] " ", [11] " ", [12] " " ], [1] [ [0] " ", [1] "B", [2] "I", [3] "C", [4] "C", [5] "Y", [6] "D", [7] "I", [8] "T", [9] "Z", [10] "I", [11] " ", [12] " " ], [2] [ [0] " ", [1] "R", [2] "O", [3] "L", [4] "I", [5] "E", [6] "D", [7] "P", [8] "B", [9] "H", [10] "O", [11] "S", [12] " " ], [3] [ [0] " ", [1] "I", [2] "I", [3] "L", [4] "V", [5] "E", [6] "V", [7] "N", [8] "U", [9] "Y", [10] "Z", [11] "L", [12] " " ], [4] [ [0] "T", [1] "S", [2] "E", [3] "C", [4] "R", [5] "E", [6] "N", [7] "E", [8] "O", [9] "T", [10] "Y", [11] "U", [12] "E" ] ]

Replies are listed 'Best First'.
Re: Converting tesseract box data into 2d grid
by BrowserUk (Patriarch) on Jan 29, 2015 at 00:17 UTC

    Update:BTW, there is an error in your data compared to the image; the R at the start of the middle line has come out as an I.

    Inverting the table left as an exercise, as is programmaticly deriving the magic numbers :):

    #! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 200; use List::Util qw[ min ]; my %pos; while( <DATA> ) { my( $c, $x1, $y1, $x2, $y2 ) = split ' '; $pos{ int( $y1 / 12 ) }{ int( $x1 / 10.5 ) } = $c; } my $firstY = min( keys %pos ); my @firstXs = sort{ $a <=> $b } keys %{ $pos{ $firstY } }; for my $y ( grep $_ != $firstY, keys %pos ) { $pos{ $y }{ $_ } //= ' ' for @firstXs; } pp \%pos; __DATA__ ... from the OP

    Output:

    C:\test>1114833 { 1 => { 1 => "T", 4 => "S", 6 => "E", 9 => "C", 11 => "I", 13 => "E", + 16 => "N", 18 => "E", 20 => "O", 23 => "T", 25 => "Y", 28 => "U", 30 + => "E" }, 3 => { 1 => " ", 4 => "I", 6 => "I", 9 => "L", 11 => "V", 13 => "E", + 16 => "V", 18 => "N", 20 => "U", 23 => "Y", 25 => "Z", 28 => "L", 30 + => " " }, 5 => { 1 => " ", 4 => "I", 6 => "O", 9 => "L", 11 => "I", 13 => "E", + 16 => "D", 18 => "P", 20 => "B", 23 => "H", 25 => "O", 28 => "S", 30 + => " " }, 7 => { 1 => " ", 4 => "B", 6 => "I", 9 => "C", 11 => "C", 13 => "Y", + 16 => "D", 18 => "I", 20 => "T", 23 => "Z", 25 => "I", 28 => " ", 30 + => " " }, 9 => { 1 => " ", 4 => " ", 6 => "N", 9 => " ", 11 => "K", 13 => "A", + 16 => " ", 18 => " ", 20 => " ", 23 => "L", 25 => " ", 28 => " ", 30 + => " " }, }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
    , 28 =V
      Thanks, this looks great! I'm guessing the magic numbers can be found by averaging the dimensions from all the bounding boxes.
        'm guessing the magic numbers can be found by averaging the dimensions from all the bounding boxes.

        Actually no. I tried that; but the average x-dimension comes out at 8.5882352941176470588235294117647; which was no good at all.

        I'm afraid I cheated a little. The Y dimension was obvious as all the Ys come out as 12.

        For the X, I inspected the data and guessed at 10, but that put to many extra spaces in:

        C:\test>1114833 { 1 => { 1 => "T", 4 => "S", 7 => "E", 9 => "C", 11 => "I", 14 => "E", + 16 => "N", 19 => "E", 21 => "O", 24 => "T", 26 => "Y", 29 => "U", 32 + => "E" }, 3 => { 1 => " ", 4 => "I", 6 => "I", 7 => " ", 9 => "L", 11 => " ", +12 => "V", 14 => "E", 16 => " ", 17 => "V", 19 => "N", 21 => "U", 24 +=> "Y", 26 => "Z", 29 => "L", 32 => " " }, 5 => { 1 => " ", 4 => "I", 6 => "O", 7 => " ", 9 => "L", 11 => "I", +14 => "E", 16 => "D", 19 => "P", 21 => " ", 22 => "B", 24 => "H", 26 +=> "O", 29 => "S", 32 => " " }, 7 => { 1 => " ", 4 => "B", 6 => "I", 7 => " ", 9 => "C", 11 => "C", +14 => "Y", 16 => "D", 19 => "I", 21 => "T", 24 => "Z", 26 => "I", 29 +=> " ", 32 => " " }, 9 => { 1 => " ", 4 => " ", 6 => "N", 7 => " ", 9 => " ", 11 => " ", +12 => "K", 14 => "A", 16 => " ", 19 => " ", 21 => " ", 24 => "L", 26 +=> " ", 29 => " ", 32 => " " }, }

        So then I tried 11 but it still put one extra in the middle row:

        C:\test>1114833 { 1 => { 1 => "T", 4 => "S", 6 => "E", 8 => "C", 10 => "I", 13 => "E", + 15 => "N", 17 => "E", 19 => "O", 22 => "T", 24 => "Y", 26 => "U", 29 + => "E" }, 3 => { 1 => " ", 4 => "I", 6 => "I", 8 => "L", 10 => "V", 13 => "E", + 15 => "V", 17 => "N", 19 => "U", 22 => "Y", 24 => "Z", 26 => "L", 29 + => " " }, 5 => { 1 => " ", 4 => "I", 6 => "O", 8 => "L", 10 => "I", 13 => "E", + 15 => "D", 17 => "P", 19 => " ", 20 => "B", 22 => "H", 24 => "O", 26 + => "S", 29 => " " }, 7 => { 1 => " ", 4 => "B", 6 => "I", 8 => "C", 10 => "C", 13 => "Y", + 15 => "D", 17 => "I", 19 => "T", 22 => "Z", 24 => "I", 26 => " ", 29 + => " " }, 9 => { 1 => " ", 4 => " ", 6 => "N", 8 => " ", 10 => "K", 13 => "A", + 15 => " ", 17 => " ", 19 => " ", 22 => "L", 24 => " ", 26 => " ", 29 + => " " }, }

        So then I tried 10.5 and voilą!

        I also tried dividing the overall width of the longest line by the number of chars: 327-18 / 13 = 23.769...

        And just now I tried taking the differences of all the x1s in the longest row:

        T 18 15 28 27 0 S 45 15 54 27 0 -> 27 E 70 15 77 27 0 -> 25 C 95 15 103 27 0 -> 25 I 119 15 128 27 0 -> 24 E 146 15 153 27 0 -> 27 N 168 15 179 27 0 -> 22 E 196 15 203 27 0 -> 28 O 218 15 229 27 0 -> 22 T 244 15 254 27 0 -> 26 Y 269 15 278 27 0 -> 25 U 295 15 305 27 0 -> 26 E 320 15 327 27 0 -> 25

        Then averaging those: 27+ 25+ 25+ 24+ 27+ 22+ 28+ 22+ 26+ 25+ 26+ 25 = 302 / 12 = 25.166666666666666666666666666667

        An' wadda ya know. It works perfectly!


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
      You're right about the bad data, I think I have to preprocess the image a bit, maybe add a threshold and blur filter before doing the ocr.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1114833]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-03-29 06:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found