Statsmodel Multiple Linear Regression Error - Python

Question

I am running (what I think is) as fairly straightforward multiple linear regression model fit using Stats model.

My code is as follows:

y = 'EXITS|20:00:00'
all_columns = "+".join(y_2015piv.columns - ['EXITS|20:00:00'])
reg_formula = "y~" + all_columns

lm= smf.ols(formula=reg_formula, data=y_2015piv).fit()

Because I have about 30 factor variables I'm creating the formula using Python string manipulation. "y" is as presented above. all_columns is the dataframe y_2015piv columns without "y".

This is all_columns:

DAY_Fri+DAY_Mon+DAY_Sat+DAY_Sun+DAY_Thu+DAY_Tue+DAY_Wed+ENTRIES|00:00:00+ENTRIES|04:00:00+ENTRIES|08:00:00+ENTRIES|12:00:00+ENTRIES|16:00:00+ENTRIES|20:00:00+EXITS|00:00:00+EXITS|04:00:00+EXITS|08:00:00+EXITS|12:00:00+EXITS|16:00:00+MONTH_Apr+MONTH_Aug+MONTH_Dec+MONTH_Feb+MONTH_Jan+MONTH_Jul+MONTH_Jun+MONTH_Mar+MONTH_May+MONTH_Nov+MONTH_Oct+MONTH_Sep

The values in the dataframe are continuous numerical variables and 0/1 dummy variables.

When I try and fit the model I get this error:

PatsyError: numbers besides '0' and '1' are only allowed with **
    y~DAY_Fri+DAY_Mon+DAY_Sat+DAY_Sun+DAY_Thu+DAY_Tue+DAY_Wed+ENTRIES|00:00:00+ENTRIES|04:00:00+ENTRIES|08:00:00+ENTRIES|12:00:00+ENTRIES|16:00:00+ENTRIES|20:00:00+EXITS|00:00:00+EXITS|04:00:00+EXITS|08:00:00+EXITS|12:00:00+EXITS|16:00:00+MONTH_Apr+MONTH_Aug+MONTH_Dec+MONTH_Feb+MONTH_Jan+MONTH_Jul+MONTH_Jun+MONTH_Mar+MONTH_May+MONTH_Nov+MONTH_Oct+MONTH_Sep

There is nothing on line that addresses what this could be. Any help appreciated.

By the way, when I fit this model in Scikit-learn it works fine. So I figure the data is in order.

Thanks in advance.

Learner · Accepted Answer · 2017-06-15 16:48:40Z

The first error that I got was this:

PatsyError: numbers besides '0' and '1' are only allowed with **
Temp ~ MEI+ CO2+ CH4+ N2O+ CFC-11+ CFC-12+ TSI+ Aerosols
                               ^^

According to this link: http://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q you can use Q("var") in the formula to get rid of the error. I was getting the same error but it was solved.

linMod = smf.ols('Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11")+ Q("CFC-12")+ TSI+ Aerosols',data = trainingSet).fit()

this is the solved line of code. I had tried

linMod = smf.ols('Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11 + CFC-12")+ TSI+ Aerosols',data = trainingSet).fit()

but this did not work. It seems that when using formula, the numbers and variables happen to have certain meaning that does not let the use of certain names. in my case error was:

PatsyError: Error evaluating factor: NameError: no data named 'CFC-11+ CFC-12' found
Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11+ CFC-12")+ TSI+ Aerosols
                           ^^^^^^^^^^^^^^^^^^^

Josef · Accepted Answer · 2016-05-21 02:31:05Z

2

patsy is handling the formula parsing and is parsing the string and interpreting it as formula with the given syntax. So some elements in the string are not allowed because they are part of the formula syntax. To keep them as names, patsy also has a code for taking the names as literal text Q which should work in this case http://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q

Otherwise, if you already have the full design matrix with all the dummy variables, then there is no reason to go through the formula interface. Using the direct interface with pandas DataFrames or numpy arrays:

sm.OLS(y, x)

will ignore any names of DataFrame columns except for using it as strings in the summary table. Variable/column names are also used as one way of defining restrictions for t_test but those go also through patsy and I am not sure it works with special characters in the names.

answered May 21, 2016 at 2:31

Josef

23.1k3 gold badges60 silver badges73 bronze badges

2 Comments

Windstorm1981 Over a year ago

Using the 'Q' notation still didn't work. I got a different error. However, when I used the direct interface as you suggested It worked fine with the variable names as is. Thanks!

Johannes Wiesner Over a year ago

Note, however, that this might not always give you the same results. For instance, only when you the formula in statsmodels MANOVA, the model will fit an intercept. statsmodels.org/dev/generated/…

lingraj S Vannur · Accepted Answer · 2022-11-29 01:47:06Z

0

Error: Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11+ CFC-12")+ TSI+ Aerosols

Answer: Temp ~ MEI+ CO2+ CH4+ N2O+ CFC_11+ CFC_12+ TSI+ Aerosols.

You need to remove the symbols like minus or hyphen ('-'), small brackets from the column names. In this way you can solve the problem.

    df = pd.read_csv(filepath)
    col = []
    for i in df.columns:
        i = i.replace('-','_')
        i = i.replace('(','_')
        i = i.replace(')','_')
        col.append(i)
    df.columns = columns

edited Nov 29, 2022 at 1:47

answered Nov 29, 2022 at 1:45

lingraj S Vannur

11 bronze badge

Comments

stm · Accepted Answer · 2023-06-22 08:14:25Z

0

This error can pop up when you include numbers (besides 0 and 1) in your formula, like y ~ 1.23 * var1 + 4.56 * var2

answered Jun 22, 2023 at 8:14

stm

1111 silver badge6 bronze badges

Collectives™ on Stack Overflow

Statsmodel Multiple Linear Regression Error - Python

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related