Understanding and interpreting Cox Regression when using ordered factors

Question

I am trying to understand ordered factors (polynomial terms) and their interpretation in Cox Proportional Hazards regression model. I know when using lm() to fit linear models, you can model a a variable as an unoredered factor, ordered factor, or quantitative variable.

I will use the mtcars dataset and then the mort dataset:

library(eha)
library(survival)
library(cmprsk)
library(car)

mtcarsData <- mtcars
mtcarsData$gearOF <- factor(mtcarsData$gear, c(3, 4, 5), 
                              labels(3, 4, 5), ordered=TRUE)

mtcarsModel1 <- lm(mpg~gear, data=mtcarsData)
mtcarsModel2 <- lm(mpg~gearOF, data=mtcarsData)

mtcarsModel1 output:

Call:
lm(formula = mpg ~ gear, data = mtcarsData)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.240  -2.793  -0.205   2.126  12.583 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)    5.623      4.916   1.144   0.2618   
gear           3.923      1.308   2.999   0.0054 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.374 on 30 degrees of freedom
Multiple R-squared:  0.2307,    Adjusted R-squared:  0.205
F-statistic: 8.995 on 1 and 30 DF,  p-value: 0.005401

I know here that gear has a significant effect on cars mpg. I know that the equation looks something like the following:

 y = (B1 X) + B0
 y = (3.923 * gear) + 5.623

For mtcarsModel2 output:

Call:
 lm(formula = mpg ~ gearOF, data = mtcarsData)

 Residuals:
     Min      1Q  Median      3Q     Max 
 -6.7333 -3.2333 -0.9067  2.8483  9.3667 

 Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
 (Intercept)  20.6733     0.9284  22.267  < 2e-16 ***
 gearOF.L      3.7288     1.7191   2.169  0.03842 *  
 gearOF.Q     -4.7275     1.4888  -3.175  0.00353 ** 
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Residual standard error: 4.708 on 29 degrees of freedom
 Multiple R-squared:  0.4292,   Adjusted R-squared:  0.3898 
 F-statistic:  10.9 on 2 and 29 DF,  p-value: 0.0002948

I know that .L and .Q refer to linear and quadratic polynomial terms. I think something like the following (correct me if I am wrong, I want to get better):

y = B2*X2^2 + B1*X1 + B0
y = (-4.7275 * X2^2) + (3.7288 * X1) + 20.6733

To me this second model is more of an interpretation of the trend of the data of gears as an ordered factor on mpg. That it could be modeled as either linear or quadratic regression. Is this a fair interpretation or am I wrong in thinking this?

Does the same hold true for a Cox Proportional Hazards model when looking at an odered factor?

Below I wil use the mort dataset as an example:

A data frame with 2058 observations on the following 6 variables:

id              Personal identification number.
enter           Start of duration. Measured in years since the fortieth birthday.
exit            End of duration. Measured in years since the fortieth birthday.
event           a logical vector indicating death at end of interval.
birthdate       The birthdate in decimal form.
ses             Socio-economic status, a factor with levels lower, upper

Example:

data <- mort
data$sesOF <- factor(data$ses, c('lower', 'upper'), 
                      label=c('lower', 'upper'), ordered=TRUE)

surv_obj <- Surv(data$enter, data$exit, data$event )
model1 <- coxph(surv_obj ~  ses, data=data)
model2 <- coxph(surv_obj ~  sesOF, data=data)

Model1 output:

Call:
coxph(formula = surv_obj ~ ses, data = data)

  n= 1208, number of events= 276 

            coef exp(coef) se(coef)      z Pr(>|z|)    
sesupper -0.4795    0.6191   0.1207 -3.972 7.13e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

         exp(coef) exp(-coef) lower .95 upper .95
sesupper    0.6191      1.615    0.4886    0.7844

Concordance= 0.56  (se = 0.015 )
Likelihood ratio test= 15.81  on 1 df,   p=7e-05
Wald test            = 15.77  on 1 df,   p=7e-05
Score (logrank) test = 16.08  on 1 df,   p=6e-05

Model2 output:

Call:
coxph(formula = surv_obj ~ sesOF, data = data)

  n= 1208, number of events= 276 

            coef exp(coef) se(coef)      z Pr(>|z|)    
sesOF.L -0.33908   0.71243  0.08537 -3.972 7.13e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

        exp(coef) exp(-coef) lower .95 upper .95
sesOF.L    0.7124      1.404    0.6027    0.8422

Concordance= 0.56  (se = 0.015 )
Likelihood ratio test= 15.81  on 1 df,   p=7e-05
Wald test            = 15.77  on 1 df,   p=7e-05
Score (logrank) test = 16.08  on 1 df,   p=6e-05

My interpretation of model1 is that an upper socio-economic status lowers the risk of death (HR of 0.6191).

In model2, when socio-economic status is modeled as an odered factor it returns .L polynomial term. Would similar interpretations be made to model mtcarsModel2 in the sense that this output is more of an interpretation of the trend of socio-economic status as an ordered factor?

My intuition tells me that if I were to be reporting this output, model1 would be the preffered analysis to report, while model2 here tells us more about the trend of the data.

I appreciate any help on this! :)

References:

1 - How do you interpret outputs of Cox regression based on data type of an independent categorical variable?

2- Interpreting the estimate of an ordered factor in regression

3- GLM interpretation of parameters of ordinal predictor variables

4- polynomial contrasts in survival models

5- Results of lm() function with a dependent ordered categorical variable?

6- Polynomial contrasts for regression

Greg Snow · Accepted Answer · 2025-11-10 18:35:30Z

The polynomial contrasts on ordered categorical variables make the most sense when the variables represents numbers or other levels that are considered to have equal sequential differences. So the gear example makes sense because the difference between 5 and 4 is the same as the difference between 4 and 3. But more generally, those contrasts are not the best ones to use.

Also, in your coxph example, the categorical variable only has 2 levels, so which contrasts/comparisons you use does not really matter.

For ordered categorical predictor variables I prefer to use sequential or successive difference coding (see contr.sdif in the MASS package). The interpretation in this case is how each level differs from the previous level. So with the mileage example, the first coefficient measures the difference in mileage between 4 gear and 3 gear cars, then the next coefficient measures the difference between 5 gear and 4 gear cars.

If you had a SES variable with more levels, for example: low, low_middle, middle, high_middle, and high; then the contrasts would measure low_middle vs. low, middle vs. low_middle, high_middle vs. middle, and high vs. high_middle. Now looking to see if all the coefficients are positive (or negative, or non-negative, etc.) may be an interesting result, or if all the coefficients are equal (linear), or if the first few are positive, then the remaining ones are not significantly different from 0, if one coefficient is different from 0, but the rest are not, or other combinations are more interesting to me than quadradic, cubic, etc. on levels that may or may not be spread equally.

Thanks Greg for clarifying. I just want to make sure I understood what you just posted. The ordered factor gearsOF's levels gearsOF.L and gearsOF.Q (despite not being contr.sdif) are coefficient measures of the difference in mileage between gears of the ordered factor (gears: 3 < 4 < 5). In the cox regression model, the same idea can be applied. If a categorical only has 2 levels it doesn't matter what contrasts/comparisons are used, so for a two level ordered categorical it doesn't matter if contr.treatment, contr.poly, contr.sdif is used? — SIO
– SIO, Commented Nov 10 at 20:41
@SIO, Correct on all the contrasts giving the same with only 2 levels. For the gears, gearsOF.L measures the linear relationship of the numerical valeus gearsOF.Q' measures the quadratic (in this case the non-linear piece), how much does gears 4 differ from being halfway between gears 3 and 5. With contr.sdif` the first coefficient measures 4 vs 3 gears, and the second measures 5 vs 4 gears, which I think are easier to interpret, think about. — Greg Snow
– Greg Snow, Commented Nov 10 at 23:40

Peter Flom · Accepted Answer · 2025-11-10 11:35:54Z

3

One thing you can do when you want to check whether your interpretation of the model is correct is to look at the predicted values and see if they match what you think. For your first data set, you could look at a plot (easy since there is only one IV):

plot(mtcarsData$gear,mtcarsModel2$fitted.values)

Yields

I leave it as an exercise to see if those predicted values are what you expect.

If you have more independent variables, you can make a table with various levels of each. If you have too many variables for that, you can pick some examples and see.

Of course, a model of MPG based only on gears is kind of silly, but I figure you were just using this as an example.

You could do similar things with your second data set.

answered Nov 10 at 11:35

Peter Flom

141k37 gold badges200 silver badges484 bronze badges

$\begingroup$ Re "to see if they match what you think:" that would indicate viewing a scatterplot of predicted vs. actual values, which is the standard method. Why leave that as an exercise? $\endgroup$

whuber
– whuber ♦

2025-11-10 13:33:11 +00:00
Commented Nov 10 at 13:33
$\begingroup$ @whuber I meant that I left the arithmetic of seeing if the OP's formula yields the predicted values. $\endgroup$

Peter Flom
– Peter Flom

2025-11-10 15:05:31 +00:00
Commented Nov 10 at 15:05
1

$\begingroup$ If I calculated predicted values for mtcarsModel2 using the formula y = B2*X2^2 + B1*X1 + B0 I get a similar plot as plot(mtcarsDatagear,mtcarsModel2 fitted.values) and the formula looks something like y = B2*x.Q + B1*x.L + B0 for contr.poly(3), where B2 is coef for .Q, B1 is coeff of .L and B0 is the intercept. So the formula is inline with what I would expect for calculating predicted values. Looking over the scatterplot of predicted vs. actual values, residuals, and qq-plot, they are what I would expect for this example. $\endgroup$

SIO
– SIO

2025-11-10 19:53:00 +00:00
Commented Nov 10 at 19:53
$\begingroup$ I appreciate the responses. I think @GregSnow answered the question I was about to ask you guys next. $\endgroup$

SIO
– SIO

2025-11-10 20:12:04 +00:00
Commented Nov 10 at 20:12

Add a comment |

Stack Exchange Network

Understanding and interpreting Cox Regression when using ordered factors

2 Answers 2

Your Answer

Linked

Hot Network Questions

Understanding and interpreting Cox Regression when using ordered factors

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions