statistics - Unexpected standard errors with weighted least squares in Python Pandas -


in the code main ols class in python pandas, looking clarify conventions used standard error , t-stats reported when weighted ols performed.

here's example data set, imports use pandas , use scikits.statsmodels wls directly:

import pandas import numpy np statsmodels.regression.linear_model import wls  # make random data. np.random.seed(42) df = pd.dataframe(np.random.randn(10, 3), columns=['a', 'b', 'weights'])  # add intercept term direct use in wls df['intercept'] = 1   # add number (i picked 10) stabilize weight proportions little. df['weights'] = df.weights + 10  # fit regression models. pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights) sm_wls = wls(df.a, df[['intercept','b']], weights=df.weights).fit() 

i use %cpaste execute in ipython , print summaries of both regressions:

in [226]: %cpaste pasting code; enter '--' alone on line stop or use ctrl-d. :import pandas :import numpy np :from statsmodels.regression.linear_model import wls : :# make random data. np:np.random.seed(42) :df = pd.dataframe(np.random.randn(10, 3), columns=['a', 'b', 'weights']) : :# add intercept term direct use in wls :df['intercept'] = 1 : :# add number (i picked 10) stabilize weight proportions little. :df['weights'] = df.weights + 10 : :# fit regression models. :pd_wls = pandas.ols(y=df.a, x=df.b, weights=df.weights) :sm_wls = wls(df.a, df[['intercept','b']], weights=df.weights).fit() :--  in [227]: pd_wls out[227]:  -------------------------summary of regression analysis-------------------------  formula: y ~ <x> + <intercept>  number of observations:         10 number of degrees of freedom:   2  r-squared:         0.2685 adj r-squared:     0.1770  rmse:              2.4125  f-stat (1, 8):     2.9361, p-value:     0.1250  degrees of freedom: model 1, resid 8  -----------------------summary of estimated coefficients------------------------       variable       coef    std err     t-stat    p-value    ci 2.5%   ci 97.5% --------------------------------------------------------------------------------              x     0.5768     1.0191       0.57     0.5869    -1.4206     2.5742      intercept     0.5227     0.9079       0.58     0.5806    -1.2567     2.3021 ---------------------------------end of summary---------------------------------   in [228]: sm_wls.summ sm_wls.summary      sm_wls.summary_old  in [228]: sm_wls.summary() out[228]: <class 'statsmodels.iolib.summary.summary'> """                             wls regression results ============================================================================== dep. variable:                        r-squared:                       0.268 model:                            wls   adj. r-squared:                  0.177 method:                 least squares   f-statistic:                     2.936 date:                wed, 17 jul 2013   prob (f-statistic):              0.125 time:                        15:14:04   log-likelihood:                -10.560 no. observations:                  10   aic:                             25.12 df residuals:                       8   bic:                             25.72 df model:                           1 ==============================================================================                  coef    std err          t      p>|t|      [95.0% conf. int.] ------------------------------------------------------------------------------ intercept      0.5227      0.295      1.770      0.115        -0.158     1.204 b              0.5768      0.333      1.730      0.122        -0.192     1.346 ============================================================================== omnibus:                        0.967   durbin-watson:                   1.082 prob(omnibus):                  0.617   jarque-bera (jb):                0.622 skew:                           0.003   prob(jb):                        0.733 kurtosis:                       1.778   cond. no.                         1.90 ============================================================================== """ 

notice mismatching standard errors: pandas claims standard errors [0.9079, 1.0191] while statsmodels says [0.295, 0.333].

back in the code linked @ top of post tried track mismatch comes from.

first, can see standard errors reports function:

def _std_err_raw(self):     """returns raw standard err values."""     return np.sqrt(np.diag(self._var_beta_raw)) 

so looking @ self._var_beta_raw find:

def _var_beta_raw(self):     """     returns raw covariance of beta.     """     x = self._x.values     y = self._y.values      xx = np.dot(x.t, x)      if self._nw_lags none:         return math.inv(xx) * (self._rmse_raw ** 2)     else:         resid = y - np.dot(x, self._beta_raw)         m = (x.t * resid).t          xeps = math.newey_west(m, self._nw_lags, self._nobs, self._df_raw,                                self._nw_overlap)          xx_inv = math.inv(xx)         return np.dot(xx_inv, np.dot(xeps, xx_inv)) 

in use case, self._nw_lags none always, it's first part that's puzzling. since xx standard product of regressor matrix: x.t.dot(x), i'm wondering how weights affect this. term self._rmse_raw comes directly statsmodels regression being fitted in constructor of ols, incorporates weights.

this prompts these questions:

  1. why standard error reported weights being applied in rmse part, not regressor variables.
  2. is standard practice if want "non-transformed" variables (wouldn't want non-transformed rmse??) there way have pandas give fully-weighted version of standard error?
  3. why misdirection? in constructor, full statsmodels fitted regression computed. why wouldn't absolutely every summary statistic come straight there? why mixed , matched come statsmodels output , come pandas home-cooked calculations?

it looks can reconcile pandas output doing following:

in [238]: xs = df[['intercept', 'b']]  in [239]: trans_xs = xs.values * np.sqrt(df.weights.values[:,none])  in [240]: trans_xs out[240]: array([[ 3.26307961, -0.45116742],        [ 3.12503809, -0.73173821],        [ 3.08715494,  2.36918991],        [ 3.08776136, -1.43092325],        [ 2.87664425, -5.50382662],        [ 3.21158019, -3.25278836],        [ 3.38609639, -4.78219647],        [ 2.92835309,  0.19774643],        [ 2.97472796,  0.32996453],        [ 3.1158155 , -1.87147934]])  in [241]: np.sqrt(np.diag(np.linalg.inv(trans_xs.t.dot(trans_xs)) * (pd_wls._rmse_raw ** 2))) out[241]: array([ 0.29525952,  0.33344823]) 

i'm confused relationship. common among statisticians: involving weights rmse part, choosing whether or not weight variables when calculating standard error of coefficient? if that's case, why wouldn't coefficients different between pandas , statsmodels, since derived variables first transformed statsmodels?

for reference, here full data set used in toy example (in case np.random.seed isn't sufficient make reproducible):

in [242]: df out[242]:                   b    weights  intercept 0  0.496714 -0.138264  10.647689          1 1  1.523030 -0.234153   9.765863          1 2  1.579213  0.767435   9.530526          1 3  0.542560 -0.463418   9.534270          1 4  0.241962 -1.913280   8.275082          1 5 -0.562288 -1.012831  10.314247          1 6 -0.908024 -1.412304  11.465649          1 7 -0.225776  0.067528   8.575252          1 8 -0.544383  0.110923   8.849006          1 9  0.375698 -0.600639   9.708306          1 

not directly answering question here, but, in general, should prefer statsmodels code pandas modeling. there discovered problems wls in statsmodels fixed. afaik, fixed in pandas, part pandas modeling code not maintained , medium term goal make sure available in pandas deprecated , has been moved statsmodels (next release 0.6.0 statsmodels should it).

to little clearer, pandas dependency of statsmodels. can pass dataframes statsmodels or use formulas in statsmodels. intended relationship going forward.


Comments

Popular posts from this blog

php - Calling a template part from a post -

Firefox SVG shape not printing when it has stroke -

How to mention the localhost in android -