Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

That's another pdftohtml. Poppler was fork of xpdf, now there are popular pdf-related utilities with identical names but different abilities.

The pdftohtml doesn't report per-character coordinates though -- only per-span (per line, usually) spacial extent. I may be mistaken, but presumably they (per-char coords) are what you were after (that "my stuff" over there above), or you wouldn't go such long way to just extract text. BTW, nowadays (for a few years) the "mudraw" was (thankfully) renamed and is invoked as

>mutool.exe draw -F stext document.pdf 2>nul <?xml version="1.0"?> <document name="(null)"> <page width="841.836" height="595.238"> <block bbox="50.172 35.47 805.83609 48.81"> <line bbox="50.172 35.47 129.62201 48.81" wmode="0" dir="1 0"> <font name="Times-Roman" size="10"> <char quad="50.172 35.47 55.172 35.47 50.172 48.81 55.172 48.81" x="50 +.172" y="46" c="2"/> <char quad="55.172 35.47 57.952 35.47 55.172 48.81 57.952 48.81" x="55 +.172" y="46" c="/"/> <char quad="57.952 35.47 62.952 35.47 57.952 48.81 62.952 48.81" x="57 +.952" y="46" c="2"/> <char quad="62.952 35.47 67.951999 35.47 62.952 48.81 67.951999 48.81" + x="62.952" y="46" c="7"/> <char quad="67.951999 35.47 70.731998 35.47 67.951999 48.81 70.731998 +48.81" x="67.951999" y="46" c="/"/> <char quad="70.731998 35.47 75.731998 35.47 70.731998 48.81 75.731998 +48.81" x="70.731998" y="46" c="2"/> <char quad="75.731998 35.47 80.731998 35.47 75.731998 48.81 80.731998 +48.81" x="75.731998" y="46" c="0"/> <char quad="80.731998 35.47 85.731998 35.47 80.731998 48.81 85.731998 +48.81" x="80.731998" y="46" c="2"/> <char quad="85.731998 35.47 90.731998 35.47 85.731998 48.81 90.731998 +48.81" x="85.731998" y="46" c="0"/> <char quad="90.731998 35.47 93.232 35.47 90.731998 48.81 93.232 48.81" + x="90.731998" y="46" c=" "/> <char quad="93.232 35.47 98.232 35.47 93.232 48.81 98.232 48.81" x="93 +.232" y="46" c="3"/> <char quad="98.232 35.47 101.012 35.47 98.232 48.81 101.012 48.81" x=" +98.232" y="46" c=":"/> <char quad="101.012 35.47 106.012 35.47 101.012 48.81 106.012 48.81" x +="101.012" y="46" c="0"/> <char quad="106.012 35.47 111.012 35.47 106.012 48.81 111.012 48.81" x +="106.012" y="46" c="7"/> <char quad="111.012 35.47 113.512 35.47 111.012 48.81 113.512 48.81" x +="111.012" y="46" c=" "/> <char quad="113.512 35.47 120.732 35.47 113.512 48.81 120.732 48.81" x +="113.512" y="46" c="A"/> <char quad="120.732 35.47 129.62201 35.47 120.732 48.81 129.62201 48.8 +1" x="120.732" y="46" c="M"/> </font> </line> ...

(see stderr output is supressed, or xml will be interspersed with "doc this, page that" messages)

The alternative is Ghostscript, of course:

>gswin64c -q -sDEVICE=txtwrite -dTextFormat=1 -o - document.pdf <page> <block> <line> <span bbox="50 46 130 46" font="Times-Roman" size="10.0000"> <char bbox="50 46 55 46" c="2"/> <char bbox="55 46 58 46" c="/"/> <char bbox="58 46 63 46" c="2"/> <char bbox="63 46 68 46" c="7"/> <char bbox="68 46 71 46" c="/"/> <char bbox="71 46 76 46" c="2"/> <char bbox="76 46 81 46" c="0"/> <char bbox="81 46 86 46" c="2"/> <char bbox="86 46 91 46" c="0"/> <char bbox="91 46 93 46" c=" "/> <char bbox="93 46 98 46" c="3"/> <char bbox="98 46 101 46" c=":"/> <char bbox="101 46 106 46" c="0"/> <char bbox="106 46 111 46" c="7"/> <char bbox="111 46 114 46" c=" "/> <char bbox="114 46 121 46" c="A"/> <char bbox="121 46 130 46" c="M"/> </span> </line> </block> ...

(see bbox is not really a box, take "size" into account to get height).

###################

At best a pure Perl solution?

Oh yes it's possible, see CAM::PDF. Chris laid beautiful foundation, huge amount of work. Some aspects are not really finished, though nothing is impossible with due diligence. Let's take a file from recent PDF question, then:

use strict; use warnings; use CAM::PDF; my $d = CAM::PDF-> new( 'document.pdf' ); my $t = $d-> getPageContentTree( 1 ); $t-> render( 'CAM::PDF::Renderer::Dump' ); __END__ ( 50.17, 549.24) ( 50.17, 549.24) 2/27/2020 ( 93.23, 549.24) ( 93.23, 549.24) 3:07 ( 113.51, 549.24) ( 113.51, 549.24) AM ( 677.77, 549.24) ( 677.77, 549.24) Quotations ( 724.17, 549.24) ( 724.17, 549.24) Due ( 743.33, 549.24) ( 743.33, 549.24) By: ( 760.28, 549.24) ( 760.28, 549.24) 01/22/2020 ( 288.40, 533.24) ( 288.40, 533.24) ABSTRA ( 344.41, 533.24) ( 344.41, 533.24) CT ( 367.36, 533.24) ( 367.36, 533.24) OF ( 390.30, 533.24) ( 390.30, 533.24) UNSTRAPPED ( 487.91, 533.24) ( 487.91, 533.24) (A ....

Something close to what you wanted? This "content tree" can be enormous structure, and easily eat 100++ MB for complex page, it follows drawing instructions as they flow during content interpretation, each node has "graphics state" attached and updated as it all proceeds. See source for an approximate idea, of course "The PDF Reference" is ultimate authority, can't avoid if you are serious about PDF.

CAM::PDF can take different "plugins" (renderers) to traverse (render) this tree. The CAM::PDF::Renderer::Dump is primitive example. Now somewhat closer to "per-character coordinates" goal:

MyTestRenderer.pm:

package MyTestRenderer; use strict; use warnings; use base 'CAM::PDF::GS'; sub new { my ( $class, @args ) = @_; my $self = $class-> SUPER::new( @args ); $self-> { mode } = 'c'; # split into characters return $self } sub renderText { my ( $self, $string, $width ) = @_; my $fontsize = $self-> { Tfs }; my ( $xu, $yu ) = $self-> textToUser( 0, 0 ); my ( $xd, $yd ) = $self-> userToDevice( $xu, $yu ); printf "(x = %5.1f, y = %5.1f) (w = %.3f, h = %3.1f) %s\n", $xd, $yd, $width, $fontsize, $string; return; } 1;

use strict; use warnings; use CAM::PDF; use lib '.'; my $d = CAM::PDF-> new( 'document.pdf' ); my $t = $d-> getPageContentTree( 1 ); $t-> render( 'MyTestRenderer' ); __END__ (x = 50.2, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 55.2, y = 549.2) (w = 0.278, h = 10.0) / (x = 58.0, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 63.0, y = 549.2) (w = 0.500, h = 10.0) 7 (x = 68.0, y = 549.2) (w = 0.278, h = 10.0) / (x = 70.7, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 75.7, y = 549.2) (w = 0.500, h = 10.0) 0 (x = 80.7, y = 549.2) (w = 0.500, h = 10.0) 2 (x = 85.7, y = 549.2) (w = 0.500, h = 10.0) 0 (x = 93.2, y = 549.2) (w = 0.500, h = 10.0) 3 (x = 98.2, y = 549.2) (w = 0.278, h = 10.0) : (x = 101.0, y = 549.2) (w = 0.500, h = 10.0) 0 (x = 106.0, y = 549.2) (w = 0.500, h = 10.0) 7 (x = 113.5, y = 549.2) (w = 0.722, h = 10.0) A (x = 120.7, y = 549.2) (w = 0.889, h = 10.0) M (x = 677.8, y = 549.2) (w = 0.722, h = 10.0) Q (x = 685.0, y = 549.2) (w = 0.500, h = 10.0) u (x = 690.0, y = 549.2) (w = 0.500, h = 10.0) o (x = 695.0, y = 549.2) (w = 0.278, h = 10.0) t ....

Problem solved? Maybe. Depends on your PDF files input. If they are as primitive and consistent as sample, and for years to follow, then yes. Otherwise, much further work is required, like I said.

Different Y-coordinates in listings above are irrelevant, depend on obvious Y-axis interpretation. GS (and CAM::PDF) report baseline position, mutool does true per-glyph bbox -- I don't thinks such precision is necessary. Just step 1-2 units down from baseline, add 1-2 units to text height. Good enough, and constant per span (line). (Not that we can't do true glyph bbox in Perl. See Font::TTF, Font::FreeType). "w" is width in "unscaled text space", -- multiply by text size. Both "w" and "h" are further to be adjusted if general transformation matrix (cm) or text matrix (tm) specify scaling different from 100% or horizontal scaling (Tz) is not 1.

Much nastier issues are there in case texts are not "single byte ascii, US-centric" encoded. See this patch to get string widths of double-byte encoded fonts. This patch may be of interest, too. As to actual text content extraction with non-ascii and/or double-byte encodings, this patch does that but was applied into different place for different (current at the time) purpose. CAM::PDF::PageText is only interested in text, it's independent from (orthogonal to) concept of "tree rendering", though it uses such a tree. The patch can be examined and snapped into appropriate place in our renderer, if you really want it done in "pure Perl".


In reply to Re^3: PDF alternative to mudrow to get XML structure by vr
in thread PDF alternative to mudrow to get XML structure by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others lurking in the Monastery: (3)
    As of 2021-03-05 04:23 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      My favorite kind of desktop background is:











      Results (109 votes). Check out past polls.

      Notices?