高系数线性回归分析

2024-09-27 23:25:34 发布

您现在位置:Python中文网/ 问答频道 /正文

所以我在这个数据集上建立了一个线性回归模型:https://www.kaggle.com/shree1992/housedata

经过我的清理和我建立我的模型,我得到了一些疯狂的高系数,这是我没有预料到的

我在谷歌上搜索了这个问题,并基于此,我做了一个岭回归,它确实修正了疯狂的系数,但分数和MAE几乎相同(线性回归分数最好,尽管两者都是MAE+分数),这表明这不是由于过度拟合,所以为什么我会得到这些高系数,我如何解释/解释它们?提前谢谢。。下面是我的系数和代码

系数:

sqft_living  :: -20531660933516.066
floors  :: -46157.99116169465
bedrooms  :: -35148.64994889144
yr_built  :: -110.275390625
sqft_lot  :: -0.01336842838432517
yr_renovated  :: 13901.669921875
bathrooms  :: 22068.444163259817
condition  :: 28854.36132510344
view  :: 54609.32181396632
waterfront  :: 619987.8770517551
statezip_WA 98070  :: 51720518.26940918
statezip_WA 98023  :: 51733793.98413086
statezip_WA 98198  :: 51745527.19320679
statezip_WA 98092  :: 51753612.19506836
statezip_WA 98003  :: 51768969.80859375
statezip_WA 98057  :: 51774754.2020874
statezip_WA 98032  :: 51777293.54980469
statezip_WA 98188  :: 51780926.42871094
statezip_WA 98022  :: 51785464.6875
statezip_WA 98042  :: 51788032.485961914
statezip_WA 98001  :: 51798657.185058594
statezip_WA 98030  :: 51800982.91894531
statezip_WA 98002  :: 51807063.37084961
statezip_WA 98038  :: 51818086.75805664
statezip_WA 98058  :: 51818726.060058594
statezip_WA 98031  :: 51820966.17700195
statezip_WA 98055  :: 51836975.10852051
statezip_WA 98178  :: 51839662.78881836
statezip_WA 98059  :: 51845304.94116211
statezip_WA 98019  :: 51849298.035583496
statezip_WA 98065  :: 51858962.752441406
statezip_WA 98014  :: 51862571.193847656
statezip_WA 98148  :: 51872288.3659668
statezip_WA 98166  :: 51878712.109375
statezip_WA 98056  :: 51890492.997558594
statezip_WA 98045  :: 51890671.47558594
statezip_WA 98168  :: 51909556.58944702
statezip_WA 98146  :: 51923932.966308594
statezip_WA 98011  :: 51925708.75717163
statezip_WA 98028  :: 51930531.6730957
statezip_WA 98155  :: 51933038.31750488
statezip_WA 98024  :: 51933207.13555908
statezip_WA 98108  :: 51935337.22363281
statezip_WA 98077  :: 51937928.41999817
statezip_WA 98072  :: 51939094.63574219
statezip_WA 98106  :: 51946079.88293457
statezip_WA 98027  :: 51954189.55102539
statezip_WA 98133  :: 51968441.83276367
statezip_WA 98118  :: 51972078.98779297
statezip_WA 98074  :: 51972640.670410156
statezip_WA 98125  :: 51985392.0078125
statezip_WA 98034  :: 51989931.86279297
statezip_WA 98053  :: 51994949.201171875
statezip_WA 98075  :: 51996895.56713867
statezip_WA 98126  :: 52003476.768066406
statezip_WA 98008  :: 52019588.31152344
statezip_WA 98029  :: 52033227.60961914
statezip_WA 98177  :: 52044918.458618164
statezip_WA 98136  :: 52054739.052734375
statezip_WA 98052  :: 52055053.704589844
statezip_WA 98006  :: 52077050.865234375
statezip_WA 98007  :: 52084987.728515625
statezip_WA 98144  :: 52104137.84765625
statezip_WA 98116  :: 52123261.3046875
statezip_WA 98033  :: 52128846.232666016
statezip_WA 98115  :: 52137801.478027344
statezip_WA 98117  :: 52140383.259521484
statezip_WA 98005  :: 52147522.69140625
statezip_WA 98122  :: 52159159.841552734
statezip_WA 98103  :: 52160013.99584961
statezip_WA 98107  :: 52176913.24609375
statezip_WA 98199  :: 52218928.334228516
statezip_WA 98102  :: 52277970.43017578
statezip_WA 98040  :: 52319189.98120117
statezip_WA 98119  :: 52323874.4597168
statezip_WA 98105  :: 52360431.115722656
statezip_WA 98109  :: 52381532.43066406
statezip_WA 98112  :: 52410056.1015625
statezip_WA 98004  :: 52665837.48083496
statezip_WA 98039  :: 52891510.521728516
sqft_basement  :: 20531660933682.504
sqft_above  :: 20531660933785.93

代码

houses_preprocessed = houses[
(houses.price<1.2*10**7) &
(houses.bedrooms>0) &
(houses.bedrooms <= 6) &
(houses.bathrooms>0) &
(houses.price>8000)].drop(columns=['country', 'date', 'street', 'city'])

houses_preprocessed.loc[houses_preprocessed['yr_renovated'] < 1, 'yr_renovated'] = 0
houses_preprocessed.loc[houses_preprocessed['yr_renovated'] > 1, 'yr_renovated'] = 1

toremove = houses_preprocessed['statezip'].value_counts()
houses_preprocessed=houses_preprocessed[houses_preprocessed.isin(toremove.index[toremove > 10]).values]

X = houses_preprocessed.drop(columns=['price'])
y = houses_preprocessed['price']

X = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

reg = LinearRegression()
reg.fit(X_train, y_train)

Tags: testtrain线性price分数系数yrwa
1条回答
网友
1楼 · 发布于 2024-09-27 23:25:34

你遇到的是multicollinearity。如果两个或多个预测值高度相关,回归模型只需使用其中一个,其他预测值将设置为无意义值。如果您查看数据:

X = houses_preprocessed.drop(columns=['price'])
y = houses_preprocessed['price']

import seaborn as sns

sns.clustermap(X.select_dtypes("number").corr(method="spearman"),figsize=(6, 6))

enter image description here

这三个变量高度相关:

sns.pairplot(X[['bathrooms','sqft_above','sqft_living']])

enter image description here

所以我们保留其中一个,最后,因为你做了一个hot,你不能适应一个截距,否则一个hot状态zip将是你截距的线性组合:

X = pd.get_dummies(X.drop(columns=['bathrooms','sqft_above']))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
reg = LinearRegression(fit_intercept=False)
reg.fit(X_train, y_train)

检查r2:

reg.score(X_test,y_test)
0.7621069304476887

考虑到y值的范围,系数现在看起来不错:

res = pd.DataFrame({'coef':reg.coef_},index=X.columns)
res.reindex(res.coef.abs().sort_values().index)


coef
sqft_lot    -0.023554
yr_built    54.699771
sqft_basement   -100.401752
sqft_living 278.836773
statezip_WA 98006   565.521930
... ...
statezip_WA 98023   -342256.082284
statezip_WA 98070   -353819.063160
statezip_WA 98004   589945.748620
waterfront  621313.209967
statezip_WA 98039   816056.566554

相关问题 更多 >

    热门问题