<?xml version="1.0" encoding="windows-1252"?>
<node id="586040" title="Re: PDF Modules Seeking Recommendations" created="2006-11-25 14:24:28" updated="2006-11-25 09:24:28">
<type id="11">
note</type>
<author id="81749">
toma</author>
<data>
<field name="doctext">
I have used another non-module approach: [http://pdftohtml.sourceforge.net] . It translates pdf to XML or HTML. The XML isn't valid, but it is not difficult to fix. This code is also based on xpdf.&lt;p&gt;

I like this approach because it gives me a bunch of text box strings with their bounding box coordinates, which I then sort by location. This is important for me because the documents that I parse tend to have an irregular 'document order.'&lt;P&gt;

I have also found pdf tips and tricks on the mostly commercial [http://www.pdfzone.com] site.

&lt;!-- Node text goes above. Div tags should contain sig only --&gt;
&lt;div class="pmsig"&gt;&lt;div class="pmsig-81749"&gt;
&lt;div class="pmsig"&gt;&lt;I&gt;It should work perfectly the first time! - toma&lt;/I&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/div&gt;</field>
<field name="root_node">
585782</field>
<field name="parent_node">
585782</field>
</data>
</node>
