Re: CAM::PDF extract text and their coordinates from pdf..

I've previously written a rendering class that does does just that:

package PDF::ToText;

use 5.006;
use warnings;
use strict;
use CAM::PDF;
use CAM::PDF::GS;
use base qw(CAM::PDF::GS);

=head1 NAME

PDF::ToText - CAM::PDF renderer to extract PDF Text and position infor
+mation

=head1 VERSION

Version 0.01

=cut

our $VERSION = '0.01';

=head1 SYNOPSIS

    use CAM::PDF;
    use PDF::ToText;
    my $pdf = CAM::PDF->new($filename);
    my $contentTree = $pdf->getPageContentTree(1);
    $contentTree->render("PDF::ToText");

=head1 SUBROUTINES/METHODS

=head2 renderText

=cut

sub _textToDevice {
    my $self = shift;

    my @t2u = $self->textToUser( @_ );
    my @t2d = $self->userToDevice( @t2u);

    return @t2d;
}

sub renderText {
   my $self = shift;
   my $string = shift;
   my $width = shift;

   # collect vertices of this text segment.

   my @bottom_left = $self->_textToDevice(0, 0);
   my @bottom_right = $self->_textToDevice($width, 0);
   my @top_left = $self->_textToDevice(0, $self->{Tfs});
   my @top_right = $self->_textToDevice($width, $self->{Tfs});

   printf "%7.2f %7.2f %7.2f %7.2f %s\n", @bottom_left, @top_right, $s
+tring; 

   return;
}
[download]

It's a drop in replacement for CAM::PDF::PageText.

In it's current state, it dumps text coordinates to STDOUT; but it can be easily amended to collect them in a global variable or whatever (CAM::PDF doesn't currently support the passing of handles).

Comment on Re: CAM::PDF extract text and their coordinates from pdf.. Download Code

Replies are listed 'Best First'.
Re^2: CAM::PDF extract text and their coordinates from pdf.. by umesh_epub (Novice) on Jan 10, 2013 at 05:39 UTC
Hi Snoopy, Thanks for your kind replay. How to know line start and line end. Which material we have to study for doing pdf operations. Thanks, Umesh	[reply]
Re^3: CAM::PDF extract text and their coordinates from pdf.. by snoopy (Curate) on Jan 10, 2013 at 05:58 UTC
Hi Umesh, Yes, that's the same point that I got to. In practice, you end up with a lot of text fragments that need to be reassembled into words and lines. Putting these back together into words and lines is a fair bit of work and can involve some heuristics. Rather than continuing to develop the above, I personally went with pstotext from the Ghostscript suite; it has a `-bboxes` option to output text positions and does attempt to assemble words and lines. Despite it's name it will work on pdf files. Another program I looked at was pdfminer. One of these, or something similar, might work. It's just a matter of how good a job they do. - David	[reply]
Re^4: CAM::PDF extract text and their coordinates from pdf.. by umesh_epub (Novice) on Jan 10, 2013 at 13:04 UTC
Thanks David I will look pdfminer and pstotext I have searched pstotext in my Ghostscript "GPL Ghostscript 8.70 (2009-07-31)" But that command is not available. In which version of the GS "pstotext" available. Thanks, Umesh	[reply]
Re^5: CAM::PDF extract text and their coordinates from pdf.. by snoopy (Curate) on Jan 10, 2013 at 23:19 UTC


laziness, impatience, and hubris
	PerlMonks