Part d summarizes the interpretation of clusters by annotating clusters with admixture and genealogical data.
Part e summarizes the genealogical data—birth location annotations in pedigrees (shaded symbols in d)—for the ‘African American’ cluster.
Informally, maximizing the modularity partitions the network so that a relatively large amount of IBD is contained within each of the partitions (Fig. In the top level of the hierarchical clustering, 99.9% of the IBD network (768,758 out of 769,444 vertices) was subdivided into only six clusters; of these, five each contained over 10,000 individuals.
The rest of the network was assigned to many small clusters (at most 101 members).
Points on the map with higher odds ratios indicate geographic locations that are more associated with cluster membership.
These data are made available in the public domain (Creative Commons CC0)..We complemented this two-level clustering with a spectral dimensionality reduction technique for network data.This yielded a low-dimension representation of the IBD network structure, analogous to PCA applied to genetic polymorphism data.Since these small clusters were difficult to interpret and may correspond to subpopulations that have poor representation in our database, or to unusually over-represented families, we did not investigate them further.To examine finer-scale population structure, we formed five sub-networks corresponding to the five largest clusters, then partitioned these sub-networks using the same clustering algorithm.Our first indication that demography could be inferred from genomic sharing among present-day Americans was the relationship we observed between US geography and the projection of state-level IBD summary statistics onto their first two principal components (PCs); PC 1 is correlated with north-south geography, and PC 2 is correlated with east-west (Fig. Following this initial observation, we turned to using IBD to discover previously unidentified population structure. We applied a weight function to each edge, setting the edge weight .On the basis of this choice, 769,444 (99.3%) of the vertices (individuals) formed a completely connected network; the remaining 0.7% of samples could correspond to populations poorly represented in our sample, and they were not included in our subsequent analyses.We detect densely connected clusters within the network and annotate these clusters using a database of over 20 million genealogical records.Recent population patterns captured by IBD clustering include immigrants such as Scandinavians and French Canadians; groups with continental admixture such as Puerto Ricans; settlers such as the Amish and Appalachians who experienced geographic or cultural isolation; and broad historical trends, including reduced north-south gene flow.We took a simple approach to infer population structure from the spectral dimensionality reduction by projecting all samples labelled by the hierarchical clustering onto this low-dimensional embedding, then using this data visualization to extract further clusters.These clusters, which we refer to as ‘stable subsets,’ are subsets that project away from the origin in the spectral embedding, and represent unusually disconnected parts of the network (Fig. While the hierarchical clustering identifies network structure underlying systematic patterns of variation in IBD, including continuous variation (for example, due to isolation-by-distance), visualizing this structure via the spectral analysis allows for isolation of the more discontinuous components of variation in IBD that putatively reflect genetic sharing originating from discrete populations (see Supplementary Fig.