Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Mechanize Firefox text Method

by halweitz (Novice)
on May 04, 2013 at 18:16 UTC ( #1032067=perlquestion: print w/ replies, xml ) Need Help??
halweitz has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks. I am using WWW::Mechanize::Firefox to sequentially get documents from a list of URLs. Within each document I search for a specific string. If found, I skip to the next URL. If not found I print document and then get the next URL in the list. I use the

$t = $mech->text;

method to extract the document text to search. Here's the problem: some of the documents are PDFs. As of release 19 Firefox has a built-in PDF viewer. Although Mechanize says ->text only works for HTML, when I use the

print $mech->content_type;

method it returns the value "text/HTML" for the PDF document. I was surprised that $mech->text returned anything but it did. However, the $mech->text method returned only the first two pages of the PDF and my search string could be anywhere in the document.

Is there some other way to get the content of a PDF? Can I pass the $mech object to another PDF module (most want to read the PDF from a file)? Can I tweak the text method (I really do not have the skill to do this anyway)? Should I save the PDF to a file first and use, say, CAM::PDF to read the PDF page by page?

I appreciate any/all help. Thanks for taking the time to read this.

Comment on Mechanize Firefox text Method
Select or Download Code
Re: Mechanize Firefox text Method
by afoken (Parson) on May 04, 2013 at 19:17 UTC
    As of release 19 Firefox has a built-in PDF viewer

    Technically, Firefox uses a lot of Javascript to convert the PDF document to a similar looking HTML document, which is then rendered by Firefox.

    when I use the print $mech->content_type; method it returns the value "text/HTML" for the PDF document

    That is a consequence of converting the PDF document to a HTML document.

    If you want pre-19 behaviour, disable the PDF converter in Firefox.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      Thanks for the reply. Actually, I want the post release 19 behavior because this allows the ->text method to return the text of the PDF but it does not return all the text. Therein lies my problem. I tried to set the viewer to Adobe Reader but in that case I lose script control of the document.

        PDF does not always contain text. I've seen lots of PDF files that were composed of images (scanned texts, no OCR involved). So getting no text or much less text than expected is not always a problem in your code.

        PDF is a "postscript print job on steroides". PDF is basically postscript, with lots of addons that aren't really relevant for your problem. Postscript describes how to print a page. Most times, it works roughly in reading order, but neither postscript nor PDF have a problem with a print job that first emits all "A"s, then all "B"s, then all "C"s, and so on. It inflates the print job, and it makes it really hard to extract the original text, and there seems to be software written for exactly this purpose.

        I think a much cleaner way is to determinate the URL of the PDF file (using Mechanize), download the PDF file (using LWP or Mechanize), and process the PDF file using tools like pdftotext.

        Note that you still need some OCR software for scanned images, pdftotext just extracts text from the PDF file.

        Update: There are several commercial OCR programs that can take PDF files (including those composed of scanned images) as input and deliver text or Word documents.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1032067]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2014-08-30 17:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (293 votes), past polls