Re: Going from PDF to GEDCOM

by Anonymous Monk
in reply to Going from PDF to GEDCOM

First idea, run strings, count the number of occurences and you 'll get the most common words ( burial/in/on/he/she/they/died/born/married )

To get sentences, slurp a page, split on period not followed by a comma (or other punctuation).

Then split into parts based on the common words and do something with them.

But, I've no idea how a sentence (or a bunch) translate into gedcom calls.

How did you generate the sentences in the first place? Reverse that process

Re^2: Going from PDF to GEDCOM
by jedikaiti (Hermit) on Nov 08, 2010 at 16:42 UTC

    Thanks! The getting-into-PDF process was actually automated by Family Tree Maker software, into which the GEDCOM file will be imported (once I find and reinstall it - STILL unpacking from moving in July!).

    Yes, I think I can start by separating individuals by looking for lines that begin with a number, a period, and a space. Well, most. Spouses may need to be identified using the common words method.

    Gave a little more thought to it late last night, and most people in this file are listed twice - as the children of their parents (short listing) and the more detailed individual listings. So I also need to match those up, and account for possible same names. I think I can do that by getting the basic info (name, DOB, DOD) from the short listing, then for the individual listings merging any records for whom those 3 details match, and treating anyone else as a new individual.

    Thanks again!
