Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: PDF alternative to mudrow to get XML structure

by marto (Archbishop)
on Mar 05, 2020 at 18:31 UTC ( #11113868=note: print w/replies, xml ) Need Help??


in reply to PDF alternative to mudrow to get XML structure

A recent suggestion.

  • Comment on Re: PDF alternative to mudrow to get XML structure

Replies are listed 'Best First'.
Re^2: PDF alternative to mudrow to get XML structure
by Anonymous Monk on Mar 05, 2020 at 18:49 UTC

    Thank you. Which pdftohtml do you mean? The only one I know is pdftohtml (https://www.xpdfreader.com/pdftohtml-man.html) from XpdfReader, but no -xml option there.

      That's another pdftohtml. Poppler was fork of xpdf, now there are popular pdf-related utilities with identical names but different abilities.

      The pdftohtml doesn't report per-character coordinates though -- only per-span (per line, usually) spacial extent. I may be mistaken, but presumably they (per-char coords) are what you were after (that "my stuff" over there above), or you wouldn't go such long way to just extract text. BTW, nowadays (for a few years) the "mudraw" was (thankfully) renamed and is invoked as

      >mutool.exe draw -F stext document.pdf 2>nul <?xml version="1.0"?> <document name="(null)"> <page width="841.836" height="595.238"> <block bbox="50.172 35.47 805.83609 48.81"> <line bbox="50.172 35.47 129.62201 48.81" wmode="0" dir="1 0"> <font name="Times-Roman" size="10"> <char quad="50.172 35.47 55.172 35.47 50.172 48.81 55.172 48.81" x="50 +.172" y="46" c="2"/> <char quad="55.172 35.47 57.952 35.47 55.172 48.81 57.952 48.81" x="55 +.172" y="46" c="/"/> <char quad="57.952 35.47 62.952 35.47 57.952 48.81 62.952 48.81" x="57 +.952" y="46" c="2"/> <char quad="62.952 35.47 67.951999 35.47 62.952 48.81 67.951999 48.81" + x="62.952" y="46" c="7"/> <char quad="67.951999 35.47 70.731998 35.47 67.951999 48.81 70.731998 +48.81" x="67.951999" y="46" c="/"/> <char quad="70.731998 35.47 75.731998 35.47 70.731998 48.81 75.731998 +48.81" x="70.731998" y="46" c="2"/> <char quad="75.731998 35.47 80.731998 35.47 75.731998 48.81 80.731998 +48.81" x="75.731998" y="46" c="0"/> <char quad="80.731998 35.47 85.731998 35.47 80.731998 48.81 85.731998 +48.81" x="80.731998" y="46" c="2"/> <char quad="85.731998 35.47 90.731998 35.47 85.731998 48.81 90.731998 +48.81" x="85.731998" y="46" c="0"/> <char quad="90.731998 35.47 93.232 35.47 90.731998 48.81 93.232 48.81" + x="90.731998" y="46" c=" "/> <char quad="93.232 35.47 98.232 35.47 93.232 48.81 98.232 48.81" x="93 +.232" y="46" c="3"/> <char quad="98.232 35.47 101.012 35.47 98.232 48.81 101.012 48.81" x=" +98.232" y="46" c=":"/> <char quad="101.012 35.47 106.012 35.47 101.012 48.81 106.012 48.81" x +="101.012" y="46" c="0"/> <char quad="106.012 35.47 111.012 35.47 106.012 48.81 111.012 48.81" x +="106.012" y="46" c="7"/> <char quad="111.012 35.47 113.512 35.47 111.012 48.81 113.512 48.81" x +="111.012" y="46" c=" "/> <char quad="113.512 35.47 120.732 35.47 113.512 48.81 120.732 48.81" x +="113.512" y="46" c="A"/> <char quad="120.732 35.47 129.62201 35.47 120.732 48.81 129.62201 48.8 +1" x="120.732" y="46" c="M"/> </font> </line> ...

      (see stderr output is supressed, or xml will be interspersed with "doc this, page that" messages)

      The alternative is Ghostscript, of course:

      >gswin64c -q -sDEVICE=txtwrite -dTextFormat=1 -o - document.pdf <page> <block> <line> <span bbox="50 46 130 46" font="Times-Roman" size="10.0000"> <char bbox="50 46 55 46" c="2"/> <char bbox="55 46 58 46" c="/"/> <char bbox="58 46 63 46" c="2"/> <char bbox="63 46 68 46" c="7"/> <char bbox="68 46 71 46" c="/"/> <char bbox="71 46 76 46" c="2"/> <char bbox="76 46 81 46" c="0"/> <char bbox="81 46 86 46" c="2"/> <char bbox="86 46 91 46" c="0"/> <char bbox="91 46 93 46" c=" "/> <char bbox="93 46 98 46" c="3"/> <char bbox="98 46 101 46" c=":"/> <char bbox="101 46 106 46" c="0"/> <char bbox="106 46 111 46" c="7"/> <char bbox="111 46 114 46" c=" "/> <char bbox="114 46 121 46" c="A"/> <char bbox="121 46 130 46" c="M"/> </span> </line> </block> ...

      (see bbox is not really a box, take "size" into account to get height).

      ###################

      At best a pure Perl solution?

      Oh yes it's possible, see CAM::PDF. Chris laid beautiful foundation, huge amount of work. Some aspects are not really finished, though nothing is impossible with due diligence. Let's take a file from recent PDF question, then:

      use strict; use warnings; use CAM::PDF; my $d = CAM::PDF-> new( 'document.pdf' ); my $t = $d-> getPageContentTree( 1 ); $t-> render( 'CAM::PDF::Renderer::Dump' ); __END__ ( 50.17, 549.24) ( 50.17, 549.24) 2/27/2020 ( 93.23, 549.24) ( 93.23, 549.24) 3:07 ( 113.51, 549.24) ( 113.51, 549.24) AM ( 677.77, 549.24) ( 677.77, 549.24) Quotations ( 724.17, 549.24) ( 724.17, 549.24) Due ( 743.33, 549.24) ( 743.33, 549.24) By: ( 760.28, 549.24) ( 760.28, 549.24) 01/22/2020 ( 288.40, 533.24) ( 288.40, 533.24) ABSTRA ( 344.41, 533.24) ( 344.41, 533.24) CT ( 367.36, 533.24) ( 367.36, 533.24) OF ( 390.30, 533.24) ( 390.30, 533.24) UNSTRAPPED ( 487.91, 533.24) ( 487.91, 533.24) (A ....

      Something close to what you wanted? This "content tree" can be enormous structure, and easily eat 100++ MB for complex page, it follows drawing instructions as they flow during content interpretation, each node has "graphics state" attached and updated as it all proceeds. See source for an approximate idea, of course "The PDF Reference" is ultimate authority, can't avoid if you are serious about PDF.

      CAM::PDF can take different "plugins" (renderers) to traverse (render) this tree. The CAM::PDF::Renderer::Dump is primitive example. Now somewhat closer to "per-character coordinates" goal:

      MyTestRenderer.pm:

      package MyTestRenderer; use strict; use warnings; use base 'CAM::PDF::GS'; sub new { my ( $class, @args ) = @_; my $self = $class-> SUPER::new( @args ); $self-> { mode } = 'c'; # split into characters return $self } sub renderText { my ( $self, $string, $width ) = @_; my $fontsize = $self-> { Tfs }; my ( $xu, $yu ) = $self-> textToUser( 0, 0 ); my ( $xd, $yd ) = $self-> userToDevice( $xu, $yu ); printf "(x = %5.1f, y = %5.1f) (w = %.3f, h = %3.1f) %s\n", $xd, $yd, $width, $fontsize, $string; return; } 1;

      use strict; use warnings; use CAM::PDF; use lib '.'; my $d = CAM::PDF-> new( 'document.pdf' ); my $t = $d-> getPageContentTree( 1 ); $t-> render( 'MyTestRenderer' ); __END__ (x = 50.2, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 55.2, y = 549.2) (w = 0.278, h = 10.0) / (x = 58.0, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 63.0, y = 549.2) (w = 0.500, h = 10.0) 7 (x = 68.0, y = 549.2) (w = 0.278, h = 10.0) / (x = 70.7, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 75.7, y = 549.2) (w = 0.500, h = 10.0) 0 (x = 80.7, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 85.7, y = 549.2) (w = 0.500, h = 10.0) 0 (x = 93.2, y = 549.2) (w = 0.500, h = 10.0) 3 (x = 98.2, y = 549.2) (w = 0.278, h = 10.0) : (x = 101.0, y = 549.2) (w = 0.500, h = 10.0) 0 (x = 106.0, y = 549.2) (w = 0.500, h = 10.0) 7 (x = 113.5, y = 549.2) (w = 0.722, h = 10.0) A (x = 120.7, y = 549.2) (w = 0.889, h = 10.0) M (x = 677.8, y = 549.2) (w = 0.722, h = 10.0) Q (x = 685.0, y = 549.2) (w = 0.500, h = 10.0) u (x = 690.0, y = 549.2) (w = 0.500, h = 10.0) o (x = 695.0, y = 549.2) (w = 0.278, h = 10.0) t ....

      Problem solved? Maybe. Depends on your PDF files input. If they are as primitive and consistent as sample, and for years to follow, then yes. Otherwise, much further work is required, like I said.

      Different Y-coordinates in listings above are irrelevant, depend on obvious Y-axis interpretation. GS (and CAM::PDF) report baseline position, mutool does true per-glyph bbox -- I don't thinks such precision is necessary. Just step 1-2 units down from baseline, add 1-2 units to text height. Good enough, and constant per span (line). (Not that we can't do true glyph bbox in Perl. See Font::TTF, Font::FreeType). "w" is width in "unscaled text space", -- multiply by text size. Both "w" and "h" are further to be adjusted if general transformation matrix (cm) or text matrix (tm) specify scaling different from 100% or horizontal scaling (Tz) is not 1.

      Much nastier issues are there in case texts are not "single byte ascii, US-centric" encoded. See this patch to get string widths of double-byte encoded fonts. This patch may be of interest, too. As to actual text content extraction with non-ascii and/or double-byte encodings, this patch does that but was applied into different place for different (current at the time) purpose. CAM::PDF::PageText is only interested in text, it's independent from (orthogonal to) concept of "tree rendering", though it uses such a tree. The patch can be examined and snapped into appropriate place in our renderer, if you really want it done in "pure Perl".

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11113868]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2020-05-26 18:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If programming languages were movie genres, Perl would be:















    Results (150 votes). Check out past polls.

    Notices?