statistics - Unexpected standard errors with weighted least squares in Python Pandas -
in the code main ols class in python pandas, looking clarify conventions used standard error , t-stats reported when weighted ols performed.
here's example data set, imports use pandas , use scikits.statsmodels wls directly:
import pandas import numpy np statsmodels.regression.linear_model import wls # make random data. np.random.seed(42) df = pd.dataframe(np.random.randn(10, 3), columns=['a', 'b', 'weights']) # add intercept term direct use in wls df['intercept'] = 1 # add number (i picked 10) stabilize weight proportions little. df['weights'] = df.weights + 10 # fit regression models. pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights) sm_wls = wls(df.a, df[['intercept','b']], weights=df.weights).fit()
i use %cpaste
execute in ipython , print summaries of both regressions:
in [226]: %cpaste pasting code; enter '--' alone on line stop or use ctrl-d. :import pandas :import numpy np :from statsmodels.regression.linear_model import wls : :# make random data. np:np.random.seed(42) :df = pd.dataframe(np.random.randn(10, 3), columns=['a', 'b', 'weights']) : :# add intercept term direct use in wls :df['intercept'] = 1 : :# add number (i picked 10) stabilize weight proportions little. :df['weights'] = df.weights + 10 : :# fit regression models. :pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights) :sm_wls = wls(df.a, df[['intercept','b']], weights=df.weights).fit() :-- in [227]: pd_wls out[227]: -------------------------summary of regression analysis------------------------- formula: y ~ <x> + <intercept> number of observations: 10 number of degrees of freedom: 2 r-squared: 0.2685 adj r-squared: 0.1770 rmse: 2.4125 f-stat (1, 8): 2.9361, p-value: 0.1250 degrees of freedom: model 1, resid 8 -----------------------summary of estimated coefficients------------------------ variable coef std err t-stat p-value ci 2.5% ci 97.5% -------------------------------------------------------------------------------- x 0.5768 1.0191 0.57 0.5869 -1.4206 2.5742 intercept 0.5227 0.9079 0.58 0.5806 -1.2567 2.3021 ---------------------------------end of summary--------------------------------- in [228]: sm_wls.summ sm_wls.summary sm_wls.summary_old in [228]: sm_wls.summary() out[228]: <class 'statsmodels.iolib.summary.summary'> """ wls regression results ============================================================================== dep. variable: r-squared: 0.268 model: wls adj. r-squared: 0.177 method: least squares f-statistic: 2.936 date: wed, 17 jul 2013 prob (f-statistic): 0.125 time: 15:14:04 log-likelihood: -10.560 no. observations: 10 aic: 25.12 df residuals: 8 bic: 25.72 df model: 1 ============================================================================== coef std err t p>|t| [95.0% conf. int.] ------------------------------------------------------------------------------ intercept 0.5227 0.295 1.770 0.115 -0.158 1.204 b 0.5768 0.333 1.730 0.122 -0.192 1.346 ============================================================================== omnibus: 0.967 durbin-watson: 1.082 prob(omnibus): 0.617 jarque-bera (jb): 0.622 skew: 0.003 prob(jb): 0.733 kurtosis: 1.778 cond. no. 1.90 ============================================================================== """
notice mismatching standard errors: pandas claims standard errors [0.9079, 1.0191]
while statsmodels says [0.295, 0.333].
back in the code linked @ top of post tried track mismatch comes from.
first, can see standard errors reports function:
def _std_err_raw(self): """returns raw standard err values.""" return np.sqrt(np.diag(self._var_beta_raw))
so looking @ self._var_beta_raw
find:
def _var_beta_raw(self): """ returns raw covariance of beta. """ x = self._x.values y = self._y.values xx = np.dot(x.t, x) if self._nw_lags none: return math.inv(xx) * (self._rmse_raw ** 2) else: resid = y - np.dot(x, self._beta_raw) m = (x.t * resid).t xeps = math.newey_west(m, self._nw_lags, self._nobs, self._df_raw, self._nw_overlap) xx_inv = math.inv(xx) return np.dot(xx_inv, np.dot(xeps, xx_inv))
in use case, self._nw_lags
none
always, it's first part that's puzzling. since xx
standard product of regressor matrix: x.t.dot(x)
, i'm wondering how weights affect this. term self._rmse_raw
comes directly statsmodels regression being fitted in constructor of ols
, incorporates weights.
this prompts these questions:
- why standard error reported weights being applied in rmse part, not regressor variables.
- is standard practice if want "non-transformed" variables (wouldn't want non-transformed rmse??) there way have pandas give fully-weighted version of standard error?
- why misdirection? in constructor, full statsmodels fitted regression computed. why wouldn't absolutely every summary statistic come straight there? why mixed , matched come statsmodels output , come pandas home-cooked calculations?
it looks can reconcile pandas output doing following:
in [238]: xs = df[['intercept', 'b']] in [239]: trans_xs = xs.values * np.sqrt(df.weights.values[:,none]) in [240]: trans_xs out[240]: array([[ 3.26307961, -0.45116742], [ 3.12503809, -0.73173821], [ 3.08715494, 2.36918991], [ 3.08776136, -1.43092325], [ 2.87664425, -5.50382662], [ 3.21158019, -3.25278836], [ 3.38609639, -4.78219647], [ 2.92835309, 0.19774643], [ 2.97472796, 0.32996453], [ 3.1158155 , -1.87147934]]) in [241]: np.sqrt(np.diag(np.linalg.inv(trans_xs.t.dot(trans_xs)) * (pd_wls._rmse_raw ** 2))) out[241]: array([ 0.29525952, 0.33344823])
i'm confused relationship. common among statisticians: involving weights rmse part, choosing whether or not weight variables when calculating standard error of coefficient? if that's case, why wouldn't coefficients different between pandas , statsmodels, since derived variables first transformed statsmodels?
for reference, here full data set used in toy example (in case np.random.seed
isn't sufficient make reproducible):
in [242]: df out[242]: b weights intercept 0 0.496714 -0.138264 10.647689 1 1 1.523030 -0.234153 9.765863 1 2 1.579213 0.767435 9.530526 1 3 0.542560 -0.463418 9.534270 1 4 0.241962 -1.913280 8.275082 1 5 -0.562288 -1.012831 10.314247 1 6 -0.908024 -1.412304 11.465649 1 7 -0.225776 0.067528 8.575252 1 8 -0.544383 0.110923 8.849006 1 9 0.375698 -0.600639 9.708306 1
not directly answering question here, but, in general, should prefer statsmodels code pandas modeling. there discovered problems wls in statsmodels fixed. afaik, fixed in pandas, part pandas modeling code not maintained , medium term goal make sure available in pandas deprecated , has been moved statsmodels (next release 0.6.0 statsmodels should it).
to little clearer, pandas dependency of statsmodels. can pass dataframes statsmodels or use formulas in statsmodels. intended relationship going forward.
Comments
Post a Comment