I am analyzing the relationship between age, education, and the probability of having a high income (>50K) using data from the UCI Adult dataset. I've fit a logistic regression model with a natural spline on age and an interaction with education, and I've visualized the predicted probabilities.
My Goal: For each education level, I want to find the "elbow" point on the age curve—the age at which the curves seems to change direction (in this example the positive association with income probability plateaus or begins to decline).
In a real-world context, in medical research, a plot could display patient survival probability on the Y-axis against a continuous variable on the X-axis. This plot would identify the optimal value of the variable—for each age stratum—that maximizes the probability of survival.
Based on my search, I identified several methodological approaches for this task, such as Receiver Operating Characteristic (ROC) curves, segmented regression, or the find_curve_elbow function from the {pathviewr} package. Which of these, or another method, constitutes the most statistically valid and appropriate methodology for this objective?
# Load libraries
library(dplyr)
library(ggplot2)
library(sjPlot)
library(RColorBrewer)
library(splines)
# Load dataset
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names <- c(
"age", "workclass", "fnlwgt", "education", "education_num", "marital_status",
"occupation", "relationship", "race", "sex", "capital_gain", "capital_loss",
"hours_per_week", "native_country", "income"
)
d <- read.table(url, sep = ",", header = FALSE, col.names = column_names, strip.white = TRUE)
# Prepare data
d <- d %>%
mutate(
age = as.numeric(age),
education_num = as.numeric(education_num), # this is your X2
income = factor(income, levels = c("<=50K", ">50K")) # this is your Y
)
############### Model ###############
inter <- glm(income ~ splines::ns(age, df = 4) * education_num,
data = d,
family = 'binomial')
summary(inter)
############### plot ###############
plotinter <- plot_model(inter, terms=c("age [all]", "education_num [all]"),
type='pred',
show.legend = F,
colors = colorRampPalette(RColorBrewer::brewer.pal(11, "Spectral"))(80))+
geom_point(size=0.1)
for (i in seq_along(plotinter$layers)) {
if (inherits(plotinter$layers[[i]]$geom, "GeomRibbon")) {
plotinter$layers[[i]]$aes_params$alpha <- 0.02
}
}
plotinter
