0
$\begingroup$

I am analyzing the relationship between age, education, and the probability of having a high income (>50K) using data from the UCI Adult dataset. I've fit a logistic regression model with a natural spline on age and an interaction with education, and I've visualized the predicted probabilities.

My Goal: For each education level, I want to find the "elbow" point on the age curve—the age at which the curves seems to change direction (in this example the positive association with income probability plateaus or begins to decline).

In a real-world context, in medical research, a plot could display patient survival probability on the Y-axis against a continuous variable on the X-axis. This plot would identify the optimal value of the variable—for each age stratum—that maximizes the probability of survival.

Based on my search, I identified several methodological approaches for this task, such as Receiver Operating Characteristic (ROC) curves, segmented regression, or the find_curve_elbow function from the {pathviewr} package. Which of these, or another method, constitutes the most statistically valid and appropriate methodology for this objective?

# Load libraries
library(dplyr)
library(ggplot2)
library(sjPlot)
library(RColorBrewer)
library(splines)

# Load dataset
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names <- c(
  "age", "workclass", "fnlwgt", "education", "education_num", "marital_status",
  "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss",
  "hours_per_week", "native_country", "income"
)
d <- read.table(url, sep = ",", header = FALSE, col.names = column_names, strip.white = TRUE)

# Prepare data
d <- d %>%
  mutate(
    age = as.numeric(age),
    education_num = as.numeric(education_num),  # this is your X2
    income = factor(income, levels = c("<=50K", ">50K"))  # this is your Y
  )

############### Model ############### 
inter <- glm(income ~ splines::ns(age, df = 4) * education_num,
             data = d,
             family = 'binomial')

summary(inter)

############### plot ############### 
plotinter <- plot_model(inter, terms=c("age [all]", "education_num [all]"),
                        type='pred',
                        show.legend = F,
                        colors = colorRampPalette(RColorBrewer::brewer.pal(11, "Spectral"))(80))+
  geom_point(size=0.1)

for (i in seq_along(plotinter$layers)) {
  if (inherits(plotinter$layers[[i]]$geom, "GeomRibbon")) {
plotinter$layers[[i]]$aes_params$alpha <- 0.02
  }
}

plotinter

The plot: enter image description here

$\endgroup$
6
  • $\begingroup$ For the upper curves, I'm not sure you even need statistics - the elbow is somewhere around 33. For the lower curves, I'm not sure you even have elbows (or at least the same type of elbows as the upper curves). I'd expect most methods should give you the "obvious" answer, and if they don't, you should question whether they appropriately summarize this particular dataset. And if there is no visually compelling choice of elbow, you should question whether it's helpful to have a statistic that picks one anyway - you can compute an elbow for that bottom curve, but does it have any real meaning? $\endgroup$ Commented Oct 15 at 18:06
  • $\begingroup$ "the age at which the curves seems to change direction" Does that mean $\frac{\partial^2 \text{income}}{\partial \text{age}^2} = 0$? I'm not familiar with the term "elbow points". $\endgroup$ Commented Oct 16 at 5:20
  • $\begingroup$ @Roland That's one objective measure of an elbow point, but it more generally suggests a qualitative/meaningful change in slope - a point of "diminishing returns". You could calculate the point of zero second derivative for any of these curves, but I'd argue it might not be terribly meaningful for the bottom-most curves. $\endgroup$ Commented Oct 16 at 16:18
  • $\begingroup$ @NuclearHoagie I would first analyze with an objective approach and then interpret the result. You can calculate the inflection point and then still judge if it is meaningful. $\endgroup$ Commented Oct 16 at 18:08
  • $\begingroup$ @Roland I might do the opposite - there is danger in generating descriptive statistics that don't actually describe what they're supposed to. I'd argue it's worse than useless to calculate the mean of a bimodal dataset, for example, as it doesn't tell you anything useful and can only provide a misleading summary. $\endgroup$ Commented Oct 16 at 18:58

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.