I am taking a linguistics course in which we are working on projects to crawl the web looking for linguistics papers, extract from them blocks of text that look like this:
(1) Emine elma-yi ye-di.
Emine apple-ACC eat-PAST.3sg
`Emine ate the apple.'
and analyze them for the linguistic information that they contain. It is important to keep the text in three lines and, as much as possible, preserve the whitespace between words on each line.
Most of the linguistics papers that we find are PDF files, and the method we have been using so far is just to run a PDF to text converter that has a -layout option, and then extract the blocks of text that we are interested in from the converted papers. This works ok except that the PDF to text converter sometimes gets confused by accented Latin-1 characters, non-Latin-1 characters, and probably by the internal structure of some PDF files. In those cases it can convert some lines from the PDF file to anything from slightly corrupted text to garbage.
Recently I found a PDF to HTML converter that does a much better job of preserving the layout of the PDF files and avoiding text corruption, but of course its output is HTML. The HTML specifies locations of text on a page with DIV tags that look like this:
<DIV style="position:absolute;top:217;left:216">
I was thinking of using a Perl HTML parser to write a little application to convert the HTML to plain text while preserving the layout of the blocks of text as much as possible, but if something like that already exists, I'd like to just use it instead. I have tried out several HTML to text converters, and none of the ones I tried pay attention to information like the style attribute of the DIV tag above. I realize that it is not possible to put a block of plain text exactly 217 units down and 216 units to the right of a the edge of a page, but it would probably be sufficient to use that information to figure out which text should be above, below or on the same line as other text, and, to some extent, by how much. If you know of an HTML to text converter that takes this kind of information into account, could you please let me know? Alternatively, if you know of a PDF to text converter that does a good job preserving layout and handles non-ASCII text well, could you please let me know about that too? Thanks.