4

I am running (what I think is) as fairly straightforward multiple linear regression model fit using Stats model.

My code is as follows:

y = 'EXITS|20:00:00'
all_columns = "+".join(y_2015piv.columns - ['EXITS|20:00:00'])
reg_formula = "y~" + all_columns

lm= smf.ols(formula=reg_formula, data=y_2015piv).fit()

Because I have about 30 factor variables I'm creating the formula using Python string manipulation. "y" is as presented above. all_columns is the dataframe y_2015piv columns without "y".

This is all_columns:

DAY_Fri+DAY_Mon+DAY_Sat+DAY_Sun+DAY_Thu+DAY_Tue+DAY_Wed+ENTRIES|00:00:00+ENTRIES|04:00:00+ENTRIES|08:00:00+ENTRIES|12:00:00+ENTRIES|16:00:00+ENTRIES|20:00:00+EXITS|00:00:00+EXITS|04:00:00+EXITS|08:00:00+EXITS|12:00:00+EXITS|16:00:00+MONTH_Apr+MONTH_Aug+MONTH_Dec+MONTH_Feb+MONTH_Jan+MONTH_Jul+MONTH_Jun+MONTH_Mar+MONTH_May+MONTH_Nov+MONTH_Oct+MONTH_Sep

The values in the dataframe are continuous numerical variables and 0/1 dummy variables.

When I try and fit the model I get this error:

PatsyError: numbers besides '0' and '1' are only allowed with **
    y~DAY_Fri+DAY_Mon+DAY_Sat+DAY_Sun+DAY_Thu+DAY_Tue+DAY_Wed+ENTRIES|00:00:00+ENTRIES|04:00:00+ENTRIES|08:00:00+ENTRIES|12:00:00+ENTRIES|16:00:00+ENTRIES|20:00:00+EXITS|00:00:00+EXITS|04:00:00+EXITS|08:00:00+EXITS|12:00:00+EXITS|16:00:00+MONTH_Apr+MONTH_Aug+MONTH_Dec+MONTH_Feb+MONTH_Jan+MONTH_Jul+MONTH_Jun+MONTH_Mar+MONTH_May+MONTH_Nov+MONTH_Oct+MONTH_Sep

There is nothing on line that addresses what this could be. Any help appreciated.

By the way, when I fit this model in Scikit-learn it works fine. So I figure the data is in order.

Thanks in advance.

4 Answers 4

8

The first error that I got was this:

PatsyError: numbers besides '0' and '1' are only allowed with **
Temp ~ MEI+ CO2+ CH4+ N2O+ CFC-11+ CFC-12+ TSI+ Aerosols
                               ^^

According to this link: http://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q you can use Q("var") in the formula to get rid of the error. I was getting the same error but it was solved.

linMod = smf.ols('Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11")+ Q("CFC-12")+ TSI+ Aerosols',data = trainingSet).fit()

this is the solved line of code. I had tried

linMod = smf.ols('Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11 + CFC-12")+ TSI+ Aerosols',data = trainingSet).fit()

but this did not work. It seems that when using formula, the numbers and variables happen to have certain meaning that does not let the use of certain names. in my case error was:

PatsyError: Error evaluating factor: NameError: no data named 'CFC-11+ CFC-12' found
Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11+ CFC-12")+ TSI+ Aerosols
                           ^^^^^^^^^^^^^^^^^^^
Sign up to request clarification or add additional context in comments.

Comments

2

patsy is handling the formula parsing and is parsing the string and interpreting it as formula with the given syntax. So some elements in the string are not allowed because they are part of the formula syntax. To keep them as names, patsy also has a code for taking the names as literal text Q which should work in this case http://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q

Otherwise, if you already have the full design matrix with all the dummy variables, then there is no reason to go through the formula interface. Using the direct interface with pandas DataFrames or numpy arrays:

sm.OLS(y, x)

will ignore any names of DataFrame columns except for using it as strings in the summary table. Variable/column names are also used as one way of defining restrictions for t_test but those go also through patsy and I am not sure it works with special characters in the names.

2 Comments

Using the 'Q' notation still didn't work. I got a different error. However, when I used the direct interface as you suggested It worked fine with the variable names as is. Thanks!
Note, however, that this might not always give you the same results. For instance, only when you the formula in statsmodels MANOVA, the model will fit an intercept. statsmodels.org/dev/generated/…
0

Error: Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11+ CFC-12")+ TSI+ Aerosols

Answer: Temp ~ MEI+ CO2+ CH4+ N2O+ CFC_11+ CFC_12+ TSI+ Aerosols.

You need to remove the symbols like minus or hyphen ('-'), small brackets from the column names. In this way you can solve the problem.

    df = pd.read_csv(filepath)
    col = []
    for i in df.columns:
        i = i.replace('-','_')
        i = i.replace('(','_')
        i = i.replace(')','_')
        col.append(i)
    df.columns = columns

Comments

0

This error can pop up when you include numbers (besides 0 and 1) in your formula, like y ~ 1.23 * var1 + 4.56 * var2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.