# Social capital I: measurement and associations with economic mobility

### Sample construction

This section describes the methods used to generate the data analysed in this paper. A server-side analysis script was designed to automatically process the raw data, strip the data of personal identifiers, and generate aggregate results, which we analyzed to produce the conclusions in this paper. The script then promptly deleted the raw data generated for this project (see the Privacy and Ethics section).

We work with privacy-protected data from Facebook. Survey data show that more than 69% of the US adult population used Facebook in 2019, and about three-quarters of those individuals did so every day37. The same survey also found that Facebook usage rates are similar across income groups, education levels and racial groups, as well as among urban, rural and suburban residents; they are lower among older adults and slightly higher among women than men.

Starting from the raw Facebook data as of 28 May 2022, our primary analysis sample was constructed by limiting the data to users aged between 25 and 44 years who reside in the United States, were active on the Facebook platform at least once in the previous 30 days, had at least 100 US-based Facebook friends and had a ZIP code. Our final analysis sample consists of 72.2 million Facebook users who constitute 84% of the US population between ages 25 and 44 years (based on a comparison to the 2014–2018 American Community Survey (ACS)). We focus on the 25–44-year age range because previous work37 has documented that its Facebook usage rate is above 80%, higher than for other age groups. In addition, the ACS publicly releases demographic data for certain age groups, one of which is ages 25–44 years, which enables us to compare our sample with the full population as well as to use ACS aggregates to predict SES (‘Variable definitions’).

We do not link any external individual-level information to the Facebook data. However, we use various publicly available sources of aggregate statistics to supplement our analysis, including data on median incomes by block group from the 2014–2018 ACS, data on economic mobility by Census tract and county from the Opportunity Atlas72, and measures of county-level and ZIP-level characteristics, such as the share of the population by race and ethnicity and the share of single parents, from the ACS and the Census. We describe those data in detail in Supplementary Information A.5.

### Variable definitions

We construct the following sets of variables for each person in our analysis sample. We measured these variables on 28 May 2022.

The data contain information on all friendship links between Facebook users. We focus only on friendships within our analysis sample; that is, we exclude friendships with people aged below 25 years or above 44 years, people who live outside the United States or people who do not satisfy one of our other criteria for inclusion in the analysis sample.

Facebook friendship links need to be confirmed by both parties, and most Facebook friendship links are between individuals who have interacted in person85. The Facebook friendship network can therefore be interpreted as providing data on people’s real-world friends and acquaintances rather than purely online connections. Because individuals tend to have many more friends on Facebook than they interact with regularly, we also verify that our results hold when focusing on an individual’s ten closest friends, where closeness is measured on the basis of the frequency of public interactions such as likes, tags, wall posts and comments.

#### Locations

Following prior work86, we use location data to construct statistics at various geographical levels. Every individual is assigned a residential ZIP code and county based on information and activity on Facebook, including the city reported on Facebook profiles as well as device and connection information. Formally, we use 2010 Census ZIP code tabulation areas (ZCTAs) to perform all geographical analyses of ZIP-code-level data. We refer to these ZCTAs as ZIP codes for simplicity. According to the 2014–2018 ACS, there are 219,214 Census block groups, 32,799 ZIP codes and 3,220 counties, with average populations of 1,488, 9,948 and 101,332 in each respective geographical designation.

#### Socioeconomic status

We construct a model that generates a composite measure of socioeconomic status (SES) for working-age adults (individuals between the ages of 25 and 64 years) that combines various characteristics. We construct our baseline SES measure in three steps, which are described in greater detail in Supplementary Information B.1.

First, for Facebook users who have location history (LH) settings enabled, we use the ACS to collect the median household income in their Census block group. LH is an opt-in setting for Facebook accounts that allows the collection and storage of location signals provided by a device’s operating system while the app is running. We observe Census block groups from individuals in the LH subsample. By contrast, we can only assign ZIP codes to individuals who do not have LH enabled. If an individual subsequently opts out of LH, their previously stored location signals are not retained.

Second, we estimate a gradient-boosted regression tree to predict these median household incomes using variables observed for all individuals in our sample, such as age, sex, language, relationship status, location information (ZIP code), college, donations, phone model price and mobile carrier, usage of Facebook on the Internet (rather than a mobile device), and other variables related to Facebook usage listed in Supplementary Table 4. We use this model to generate SES predictions for all individuals in our sample.

Finally, individuals (including the LH users in the training sample) are assigned percentile ranks in the national SES distribution on the basis of their predicted SES relative to others in the same birth cohort.

We do not use any information from an individual’s friends to predict their SES, which ensures that errors in the SES predictions are not correlated across friends, which would bias our estimates of homophily by SES. We also do not use direct information on individuals’ incomes or wealth, as we do not observe these variables at the individual level in our data. However, we show below that our measures of SES are highly correlated with external measures of income across subgroups.

The algorithm described above is one of many potential ways of combining a set of underlying proxies for SES into a single measure. To verify that our findings are not sensitive to the specific variables or algorithm used to predict SES, we show that our results are similar when we use a simple unweighted average of z-scores of the underlying proxies or when we directly use ZIP code median household incomes for all users, eschewing the prediction model and other proxies entirely (Supplementary Table 5).

#### Parental SES

We link individuals in our primary analysis sample to their parents (who may not be in the analysis sample themselves) to construct measures of family SES during childhood. To link individuals to their parents, we use self-reported familial ties, a hash of user last names, and public user-generated wall posts and major life events (see Supplementary Information A.2 for details). We then use the SES of parents, constructed using the algorithm described above, to assign parental SES to individuals. Finally, we assign individuals a parental SES rank on the basis of their predicted parental SES, ranking individuals on the basis of parental SES relative to others in the same birth cohort. We are able to assign parental SES ranks for 31% of the individuals in our primary analysis sample.

#### High school friendships

To identify friendships made in high school, we first use self-reports to assign individuals to schools. For people who do not report a high school, we use data on their friendship networks to impute those groups (see Supplementary Information A.3 for details). For the 3.3% of users who report multiple high schools, we select the school in which the user has the largest number of friends. This process produces information on high schools for 74.9% of individuals in our analysis sample. Finally, if an individual and one of their friends attended the same high school within three cohorts of each other, we identify them as high school friends.

### Benchmarking

Extended Data Table 4a shows summary statistics for our baseline sample and, for comparison, for those aged between 25 and 44 years in the 2014–2018 ACS. The Facebook sample is similar to the full population in terms of age, sex and language. Consistent with previous work87, women are slightly over-represented in our Facebook sample (53.6%) relative to men. The median individual in our analysis sample has 382 in-sample Facebook friends; in total, there are just under 21 billion friendship pairs between individuals in the sample.

As much of our analysis relies on variation across areas, it is important that our sample has good coverage not just nationally but also across locations. In Supplementary Information A.1, we show that our sample has high coverage rates across the United States, and that coverage rates do not vary systematically across locations with different income levels or demographic characteristics.

Most of our analysis draws on the SES measure constructed as described in the previous subsection. We evaluate the accuracy of this SES measure by correlating the share of households with above-median income within each ZIP code from the ACS with the estimated share of Facebook users with above-median SES in our sample. The population-weighted correlation between our estimates of the share of high-SES individuals and the ACS estimates at the ZIP-code level is 0.88. Furthermore, there are similarly high correlations between our estimates of the share of high-SES households and corresponding statistics drawn from external publicly available administrative datasets at the high school and college levels (see the companion paper9 for details).

For some parts of our analysis—in particular, for computing measures of EC during childhood—we focus on the subsample of individuals whom we can link to parents with an SES prediction and whom we can assign to a high school on the basis of self-reports and network-based imputations. Panel B of Extended Data Table 4 presents summary statistics for this subsample of 19.4 million users, or about 27% of the full analysis sample. The characteristics of this subsample are broadly similar to those of the full sample, although users whom we can link to high schools and parents with SES predictions are about 2 years younger on average than users in the full sample, in large part because our approach does not allow us to assign SES predictions for parents older than 65 years. County-level median household incomes differ by \$876 between the samples, about 6% of a standard deviation.

We further evaluate our SES measure and parental linkages by comparing estimates of intergenerational economic mobility using our SES proxies to publicly available estimates based directly on household incomes from population-level tax data. There is a linear relationship between individuals’ and their parents’ SES ranks across the distribution of parental SES, with a slope of 0.32 (Extended Data Fig. 2) This relationship is similar to the estimated slope of 0.34 in population tax data10, thereby supporting the validity of both our SES imputations and parental linkages.

We conclude that our Facebook analysis samples are representative of the populations we seek to study and that our measures of SES align with external data.

### Measuring connectedness

#### Economic connectedness

Let

$${f}_{Q,i}\equiv \frac{{[{\rm{N}}{\rm{u}}{\rm{m}}{\rm{b}}{\rm{e}}{\rm{r}}{\rm{o}}{\rm{f}}{\rm{f}}{\rm{r}}{\rm{i}}{\rm{e}}{\rm{n}}{\rm{d}}{\rm{s}}{\rm{i}}{\rm{n}}{\rm{S}}{\rm{E}}{\rm{S}}{\rm{q}}{\rm{u}}{\rm{a}}{\rm{n}}{\rm{t}}{\rm{i}}{\rm{l}}{\rm{e}}Q]}_{i}}{{\rm{T}}{\rm{o}}{\rm{t}}{\rm{a}}{\rm{l}}\,{\rm{n}}{\rm{u}}{\rm{m}}{\rm{b}}{\rm{e}}{\rm{r}}\,{\rm{o}}{\rm{f}}\,{{\rm{f}}{\rm{r}}{\rm{i}}{\rm{e}}{\rm{n}}{\rm{d}}{\rm{s}}}_{i}}$$

(1)

denote individual i’s share of friends from SES quantile Q. To obtain measures of the degree of homophily that are not sensitive to the size of each quantile bin, we normalize fQ,i by the share of individuals in the sample who belong to quantile Q, wQ (for example, wQ = 0.1 for deciles). We then define person i’s individual EC (IEC) to individuals from quantile Q as

$${{\rm{IEC}}}_{Q,i}\equiv \frac{{f}_{Q,i}}{{w}_{Q}}.$$

(2)

We define the level of EC in community (county or ZIP code) c as the mean level of individual EC of low-SES (for example, below-median) members of that community, as follows:

$${{\rm{EC}}}_{c}=\frac{{\sum }_{i\in L\cap c}{{\rm{IEC}}}_{i}}{{N}_{Lc}},$$

(3)

where NLc is the number of low-SES individuals in community c. When defining EC in a given community, we continue to rank individuals in the national SES distribution and include friendships to individuals residing outside that community. In the presence of homophily, EC ranges from 0 to 1, with a value of 1 indicating, for example, that half of below-median-SES individuals’ friends have above-median-SES.

We construct standard errors for EC in each location using a bootstrap resampling method that adjusts for correlations in connectedness across individuals arising from having common pools of friends (Supplementary Information B.3). Because sample sizes are large, almost none of the geographical difference in EC is due to sampling variation. At the county level, the mean standard error of 0.004 is more than an order of magnitude smaller than the signal standard deviation of EC across counties of 0.18. When we randomly split the microdata into two halves and estimate ECs by county in each half, we obtain a split-sample correlation (reliability) of 0.999 across counties, weighting by the number of people in each county with household income below the national median. The ZIP-code-level estimates we release are also precise, with a split sample reliability of 0.99 (pooling all ZIP codes in the United States) when weighted by below-median-income population.

#### Childhood EC

We construct two measures of childhood EC: one based on links between individuals and their parents in our Facebook analysis sample and another based on data from Instagram.

To measure childhood EC in the Facebook sample, we restrict the sample to individuals whom we could link to high schools and their parents (about 27% of the full sample). We assign parental SES ranks (estimated using the machine-learning algorithm described in the ‘Variable definitions’ section) within this subsample, ranking individuals on the basis of parental SES relative to others in the same birth cohort. We then measure fQ,i as the share of friends from parental-SES quantile Q within the subset of friends from high school: friends who attended the same high school and are within three cohorts of the individual (so that they would have most likely overlapped in school). Ideally, we would directly observe all friendships made during childhood. However, because the Facebook platform was not available when the members of the birth cohorts we analyse were growing up, we use current friends who attended the same high school to identify friendships made in childhood. When calculating childhood EC by location, we assign individuals to the counties where their high schools are located, rather than counties where they currently live, to map people to the places where they grew up. We do not produce ZIP-code-level measures of childhood EC because we cannot reliably infer individuals’ childhood ZIP codes from the locations of their high schools (as children from many ZIP codes might attend a given school).

To measure childhood EC for users of Instagram, a widely used social networking platform owned by Meta, we restrict the raw Instagram data to personal users (not business pages) in the United States who had not deactivated their account, been active on the platform within the past 30 days, and were predicted to be between 13 and 17 years of age as of 28 May 2022 (see Supplementary Information A.4 for further details). Next, we assign the individuals in our sample to ZIP codes on the basis of their IP address and other features. Then, we assign Instagram users an SES estimate on the basis of two variables: (1) the median household income of their residential ZIP code from publicly available data on incomes in the 25–44-year age bin from the 2014–2018 ACS, and (2) the price of their phone. We then construct a weighted z-score of these two inputs, placing two-thirds of the weight on median household income and one-third of the weight on the price of the phone. The higher weight on ZIP-code-based income relative to phone price reflects that ZIP codes played a particularly large role in the machine-learning model used to construct our baseline measures of SES in the Facebook data (although using other weights in the construction of the z-score produced similar results). We rank users nationally on the basis of these weighted z-scores to assign them a SES percentile rank. Users above the 50th percentile are termed high SES, whereas those at the 50th percentile and below are termed low SES. Finally, we construct measures of individual EC as defined in equation (2). Because ties on Instagram, which are termed ‘follows’, are directional—that is, one person can follow another without that person following them—we restrict our attention to reciprocal followers to mimic friendships on Facebook when measuring connectedness.

Each of the two measures of childhood EC has certain advantages and limitations. The Facebook parental SES measure has the advantage of capturing the childhood friendships of individuals in approximately the same set of cohorts for which we measure economic mobility. However, because we are able to construct this measure only for the 27% of individuals for whom we can link to parents and who report their high school, these estimates are noisier and potentially less representative than our baseline estimates. The Instagram data do not require parental linkage and capture all friends, not just high school friends, thereby producing a larger and more comprehensive sample. The limitation of the Instagram EC measure is that it measures EC among the 2005–2009 birth cohorts, rather than the 1978–1983 cohorts for which we measure economic mobility. However, the stability of both economic mobility72 and EC (Supplementary Fig. 1) within a location over time mitigates the consequences of this misalignment.

### Measuring cohesiveness

We represent a set of friendships by the matrix A {0, 1}n×n, where  Aij = 1 denotes the existence of a friendship (edge) between individuals i and j, and Aij = 0 denotes the absence of a friendship. We focus on three measures of the structure of A: clustering and support ratio, which are measures of local correlation in friendships, and spectral homophily, a measure of overall network fragmentation. Other measures of cohesiveness, such as algebraic connectivity88, are also informative, but are difficult to compute or even approximate for networks of the scale we analyse. The three measures of cohesiveness we focus on here have the advantage of being computationally tractable in large samples.

#### Clustering

Previous work33 has argued that if person i is friends with both persons  j and k, then having  j and k be friends with each other can help them collectively pressure and sanction person i, thereby helping to enforce norms. Motivated by this logic, many studies have measured the extent of such ‘network closure’ by the degree of clustering within a person’s network: the frequency with which two friends of that person are in turn friends with each other. Letting Ni(A) denote the set of i’s friends and di(A) its cardinality (the number of friends i has), the clustering of i’s network is defined as

$${{\rm{Clustering}}}_{i}({\bf{A}})=\sum _{k,j\in {N}_{i}({\bf{A}}),\,k < j}\frac{{A}_{kj}}{{d}_{i}({\bf{A}})({d}_{i}({\bf{A}})-1)/2}.$$

(4)

We measure clustering in a community c as the average of equation (4) across people living in that community as follows:

$${{\rm{Clustering}}}_{c}=\frac{{\sum }_{i\in c}{{\rm{Clustering}}}_{i}({\bf{A}})}{{N}_{c}}.$$

(5)

#### Support ratio

Letting Ac denote the subset of friendships between individuals who are both members of community c, we measure a community c’s support ratio as the overall frequency with which pairs of friends have at least one friend in common, focusing only on the people and friendships within that community:

$${{\rm{S}}{\rm{u}}{\rm{p}}{\rm{p}}{\rm{o}}{\rm{r}}{\rm{t}}{\rm{r}}{\rm{a}}{\rm{t}}{\rm{i}}{\rm{o}}}_{c}=\frac{|\{(ij):i,j\in c,{A}_{ij}^{c}=1,{[{({A}^{c})}^{2}]}_{ij} > 0\}|}{|\{(ij):i,j\in c,{A}_{ij}^{c}=1\}|}.$$

(6)

#### Spectral homophily

Spectral homophily measures the extent to which a network is fragmented into separate groups, and relates to the speed of information aggregation in a network. A wide variety of algorithms can detect subcommunities89, and spectral homophily provides a simple measure of how strongly a network splits into such subcommunities. Formally, spectral homophily is the second largest eigenvalue of the degree-normalized (row-stochasticized) adjacency matrix $${{{\bf{A}}}^{c}}_{{\bf{s}}}\in {[0,1]}^{n\times n}$$. We measure spectral homophily in each county on the basis of the set of friendships among individuals in our primary sample living in that county. Friendship matrices are too sparse to estimate spectral homophily reliably at the ZIP code level. In the rare instances when there are fully isolated nodes within a county, we calculate spectral homophily on the largest connected component, which usually makes up the majority of users living in a county.

### Measuring civic engagement

#### Volunteering rate

We start with the set of all Facebook Groups in the United States that are predicted to be about volunteering or activism based on their titles and do not have the privacy setting ‘secret’ enabled. To further improve this classification, we manually review the 50 largest such groups in the United States and the largest such group in each state, and remove the very small number of groups that are clearly misclassified. We then define the volunteering rate as the share of Facebook users in an area who are a member of at least one volunteering or activism group.

#### Civic organizations

We start with the set of all Facebook Pages in the United States that are categorized as ‘public good’ pages on the basis of the page title and page category. We then remove pages that do not have a website linked, do not have a description on their Facebook page or do not have an address listed. We then assign the page to a ZIP code and county on the basis of its listed address, and calculate the density of civic organizations as the number of such pages per 1,000 Facebook users in the area.

### Correlations

We weight all correlations and regressions by the number of individuals with below-national-median parental income as calculated using Census data72, unless otherwise noted. We cluster standard errors in all county-level regressions by commuting zone and ZIP-code-level regressions by county to adjust for potential spatial autocorrelation in errors, unless otherwise noted.

The causal effect estimates used in the ‘Causal effects of place versus selection’ section are identified solely from individuals who move across areas and are therefore much less precise than the baseline observational estimates of economic mobility used in the rest of the paper, making it necessary to adjust for attenuation bias in those correlation estimates due to sampling error. We adjust for attenuation bias by dividing the raw correlation between the causal estimates of mobility and EC by the square root of the reliability of the causal estimates of mobility, as estimated by Chetty and Hendren76. The causal effect estimates are also unavailable at the ZIP-code level owing to small sample sizes for ZIP-code-level moves. This is why we focus on the observational estimates of upward income mobility in our baseline analysis.

### Privacy and ethics

This project focuses on drawing high-level insights about communities and groups of people, rather than individuals. We used a server-side analysis script that was designed to automatically process the raw data, strip the data of personal identifiers, and generate aggregated results, which we analyzed to produce the conclusions in this paper. The script then promptly deleted the raw data generated for this project. While we used various publicly available sources of aggregate statistics to supplement our analysis, we do not link any external individual-level information to the Facebook data. All inferences made as part of this research were created and used solely for the purpose of this research and were not used by Meta for any other purpose.

A publicly available dataset, which only includes aggregate statistics on social capital, is available at https://www.socialcapital.org. We use methods from the differential privacy literature to add noise to these aggregate statistics to protect privacy while maintaining a high level of statistical reliability; see https://www.socialcapital.org for further details on these procedures. The project was approved under Harvard University IRB 17-1692.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

https://www.nature.com/articles/s41586-022-04996-4