Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Methods

PolicyEngine

Data: General Social Survey

We use the General Social Survey (GSS), conducted by NORC at the University of Chicago since 1972. The GSS is one of the most widely used data sources in social science, with over 75,000 respondents across 35 survey waves through 2024.

We test 16 variables spanning social values, economic attitudes, and social trust:

Table 1:GSS Variables Analyzed (16 total)

VariableTopicLiberal Response
HOMOSEXSame-sex relations“Not wrong at all”
GRASSMarijuana legalization“Legal”
PREMARSXPremarital sex“Not wrong at all”
ABANYAbortion for any reason“Yes”
FEPOLWomen in politics“Disagree” (women unsuited)
CAPPUNDeath penalty“Oppose”
GUNLAWGun permits“Favor”
NATRACESpending on race issues“Too little”
NATEDUCSpending on education“Too little”
NATENVIRSpending on environment“Too little”
NATHEALSpending on health“Too little”
EQWLTHGovernment reduce inequalityTop 3 of 7-point scale
HELPPOORGovernment help poorTop 2 of 5-point scale
TRUSTSocial trust“Can trust”
FAIRFairness of others“Try to be fair”
POLVIEWSSelf-identified liberalLiberal side (1-3 of 7)
PRAYERSchool prayer ban“Approve”

Full variable definitions, response codings, and preprocessing details in Appendix B.

We downloaded the cumulative GSS data file (gss7224_r2.dta) containing all respondents from 1972-2024. For each variable, we calculated the percentage giving the “liberal” or “progressive” response among those with valid responses (excluding “don’t know” and refusals). Years with fewer than 50 valid responses were excluded.

Mode Effects and Survey Methodology

The GSS has undergone significant methodological changes:

PeriodModeNotes
1972-2020Face-to-faceStandard in-person interviews
2021MixedCOVID-era combination of web/phone/in-person
2022-2024Web-pushPrimarily web with push-to-web methodology

Implications: Web surveys may reduce social desirability bias, yielding more candid responses on sensitive topics kreuter2008social. Some portion of observed changes between 2021 and 2024 may reflect measurement differences rather than true attitude shifts. We do not attempt statistical adjustment for mode effects, as NORC has not released official crosswalk estimates. We note this as a limitation throughout.

Models

Language Models

We tested three language models with different training cutoffs:

  1. gpt-3.5-turbo-instruct (OpenAI): Training cutoff September 2021. This model cannot have seen GSS 2021 data (released November 2021) or later.

  2. GPT-4o (OpenAI): Training cutoff October 2023. This model cannot have seen GSS 2024 data (collected April-December 2024).

  3. Claude Sonnet (Anthropic): Training cutoff early 2024. Used for initial experiments but potentially contaminated with recent GSS data.

For each forecast, we prompted the model with:

Example prompt (GPT-4o):

System: You are a social scientist in 2021. You predict survey trends based on historical data.

User: Based on historical General Social Survey data, predict the percentage of Americans
who will say "Same-sex relations not wrong" in 2024.

Historical data (% giving this response):
  1973: 11%  1990: 13%  2000: 29%  2010: 42%  2018: 57%  2021: 62%

Predict only a single number between 0 and 100.

Confidence intervals were elicited in a separate prompt. Full prompt templates and parameters in Appendix A.

Baseline Models

We compared LLMs against standard time series forecasting methods:

  1. Naive: Predict the last observed value. Uncertainty grows with forecast horizon.

  2. Linear extrapolation: Ordinary least squares regression of values on year. Uncertainty based on residual standard error.

  3. ARIMA(1,1,0): Autoregressive integrated moving average with one AR term and one differencing operation.

  4. ETS (Holt): Exponential smoothing with linear trend (Holt’s method).

Evaluation

Temporal Holdout Design

To avoid data leakage, we use strict temporal holdout:

  1. Select a cutoff year (e.g., 2000, 2010, 2021)

  2. Provide the model only with data before the cutoff

  3. Generate predictions for years after the cutoff

  4. Compare predictions to actual GSS values

For LLMs, we additionally verify that the model’s training data predates the target values. GSS data release dates:

Metrics

We evaluate forecasts using:

Mean Absolute Error (MAE):

MAE=1ni=1nyiy^i\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|

where yiy_i is the actual value and y^i\hat{y}_i is the predicted value.

Coverage: The fraction of actual values falling within the 90% confidence interval. Well-calibrated forecasts should have ~90% coverage.

Bias: Mean signed error, indicating systematic over- or under-prediction:

Bias=1ni=1n(yiy^i)\text{Bias} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)

Heterogeneity Analysis

Beyond aggregate forecasts, we analyze how attitudes vary by:

This enables testing whether LLMs can predict not just aggregate trends but shifts in the distribution of values across subgroups.