Friday, June 10, 2022

Family Search and Python

Family Search has a huge interlinked collection family tree and genealogy collection. I thought it would be fun to answer a few questions:

  • Where are my ancestors from?
  • How have family sizes changed over generations?
  • Have there been lines that have grown large, but left with no descendents.
  • How far back do you need to go before finding the same ancestor in different lines.
  • How many years is a generaation.

First step was to see what was there. FamilySearch solutions gallery contains a number of different third party tools. Some look a little sketchy, but others seem fairly well done.

Map My Cousins shows an animated map of ancestors and migration.

There was another tool that would show a pie of your origins. The problem is that there are a lot of "gaps". (A big chunk of genealogy was "unknown".) It does have the option to "guess". However this just seems to apply the same place to the parents as the children, thus leading to an overrepresentation of USA.

Time to look at the API. Alas, FamilySearch has an API, but it is not easily accessible. They seem to focus on companies rather than allow individual use. There is some mention of development program that might work. Though it seems they will only give you access to sample data at first. You then must have a company and pay $200 per app before getting real access.

So if I can't use the API, at least I could download my genealogy and then analyze it, right?  Unfortunately, there is no download option. They suggest using a third party tool, but don't recommend a specific one. I ended up using the free version of Ancestral Quest. It looks like a DOS program migrated to MAC. 

I could download up to 100 generations of data. However, it occasionally crashes. It seems to be certain lines that are more likely to cause crashes. To get around it, I loaded 100 generations back from each grandparent. Then I narrowed the problem line. Once I got errors about a circular ancestry. Maybe that was what caused the crash? It was odd that it worked one time.

It has the ability to export data as Gedcom. I found a JavaScript parser, but it was broken. I ended up going for a python Gedcom parser. Now the fun time learning Python!

My first exercise was to find lines that went back further than 100 generations. There was a method to get ancestors. However, it didn't seem to work. I had to get the hang of other methods performed on a parser object, vs. on individual elements. It took a little getting used to. I eventually did a basic tree iteration to find the "endpoints" that could go back further. 

The final results had some lineage tracing back about 200 generations. Things got sketchy as it went further back. I noticed one line went from Vikings to biblical Jewish ancestry. Some long lines went through to ancient Egypt and Mesopotamia. 

For the next step, I wanted to try to go forward from a few generations back. Getting the data loaded was a much more painful process. I tried to get 11 generations and go back from there. I first started getting everybody. That was overkill. AncestralQuest does dedup, but it explores everything first. I realized I could shortcut it by just doing the husbands. Due in part to maternal mortality, it would be more likely for a man to have multiple spouses. Other than that, they would both have the same descendents. It still was painfully long. I did notice that some lines were very complete. These would take hours to download the descendancy from a single 11-generation old ancestor. (And return almost 100k people.) Other seemed to be a tentacle that reached back. One line crashed while downloading. Uggh.

Eleven generations is a painful amount to get. You don't have access to other living people, the more recent generations are sparse. Going further back and the records are missing, making those sparse.

Now on to coding. First I parsed the file and then traversed the tree starting at the root to find the furthest back end-nodes. Then from each of those I gathered the families and then children for each generation. I needed to keep a set of ones I had seen to not get into bad loops. It seems there are some duplicates of the same person in different places. This makes it hard to distinguish from "real" common ancestors.

The average years per generation seemed to be right around 30 years.

Same Ancestor in different lines

This seemed fairly straightforward. Just store the path to each ancestor that we find. If we find the same person twice, compare the path to see how we got there.

Things got funky when I started looking at some lines that went way back. Two lines had a common ancestor. However, one had about 16 generations in between, while the other had more than 50. The birth years brought out the funkiness. On the one with more generations, it went back to the 1400s, then a few without birth years followed by some in more modern times. Looks like bad data.

Some of the other cases were possibly also bad data. It looked like the same name with a few different birth years.

There were a few "real" common ancestors. Going back a few centuries, there were some small-town residents that had common ancestor only a few generations back. There were also a few cases where different lines converged on common ancestors about 15 generations back.

No comments:

Post a Comment