利用方差膨胀F求属性间的多重共线性

2024-09-21 03:16:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图通过使用VIF来理解数据集中不同属性/变量之间可能存在的多重共线性,这通常用于回归任务。为此,我使用波士顿住房数据集。 在找到相关的特征之后,我试图删除它们,看看它对线性回归模型性能的影响。在

我写的代码如下-

import pandas as pd, numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt, seaborn as sns


'''
Regression Problem-
Boston Housing dataset. Target attribute- 'medv'

For details about dataset, refer to-
https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

'''


# Read CSV data file-
boston_data = pd.read_csv("Boston_Housing.csv")

# Get dimension of dataset-
print("\nDimension/Shape of dataset- {0}\n".format(boston_data.shape))
# Dimension/Shape of dataset- (333, 15)


# Get attribute names of dataset-
print("\nAttribute names of dataset are:\n{0}\n\n".format(boston_data.columns.tolist()))
'''
Attribute names of dataset are:
['ID', 'crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio',
'black', 'lstat', 'medv']
'''

# Dropping 'ID' attribute from dataset-
boston_data.drop('ID', axis = 1, inplace=True)

print("\nDimension/Shape of dataset after dropping 'ID' =  {0}\n".forma(boston_data.shape))
# Dimension/Shape of dataset after dropping 'ID' =  (333, 14)


# Check for missing values-
print("\nDoes dataset have missing values? {0}\n".format(boston_data.isnull().values.any()))
# Does dataset have missing values? False


# Visualize dataset distribution using boxplots-
sns.boxplot(data=boston_data)

plt.xticks(rotation = 20)
plt.title("Boston Housing dataset distribution using boxplots")
plt.show()


# Create correlogram-

# Compute correlation matrix of dataset-
boston_data_corr = boston_data.corr()                                  

sns.heatmap(boston_data_corr)

plt.xticks(rotation = 20)
plt.title("Boston Housing dataset Correlogram")
plt.show()


# Separate dataset into features (X) and label (y)-
X = boston_data.drop('medv', axis = 1)
y = boston_data['medv']


# Create training and testing sets-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

print("\nDimensions of training and testing sets are:")
print("X_train = {0}, y_train = {1}, X_test = {2} and y_test = {3}\n\n".format(X_train.shape, y_train.shape, X_test.shape, y_test.shape))
# Dimensions of training and testing sets are:
# X_train = (233, 13), y_train = (233,), X_test = (100, 13) and y_test = (100,)




# Detect Multicollinearity using VIF-
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Create a pandas Series to get labeled VIF for each attribute-
vif_info = pd.Series(vif, index=X.columns)

print("\nVIF calculated for each attribute in dataset are:\n{0}\n".format(vif_info.sort_values(ascending = True)))
'''
VIF calculated for each attribute in dataset are:
chas        1.131439
zn          2.554112
crim        2.649245
lstat      11.341896
indus      13.801806
rad        14.649941
dis        15.707447
age        21.929478
black      22.853979
tax        56.334001
nox        75.640420
rm         83.056962
ptratio    86.123039
dtype: float64
'''


# Removing collinear attributes/variables-
'''
We create a function to remove the collinear variables.
We choose a threshold of 5 which means if VIF is more than 5 for a particular
variable/attribute then that variable will be removed.
'''




# Linear Regression-

# First train a Linear Regression using all attributes to get a baseline score
lr_model = LinearRegression()

lr_model.fit(X_train, y_train)

y_pred = lr_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2s = r2_score(y_test, y_pred)

print("\nLinear Regression model metrics are:")
print("MSE = {0:.4f}, MAE = {1:.4f} & R2-Score = {2:.4f}\n\n".format(mse, mae, r2s))
# Linear Regression model metrics are:
# MSE = 33.7151, MAE = 3.8272 & R2-Score = 0.6831



# Training a Linear Regression using attributes whose VIF <= 5

# Attributes to use for model training using VIF-
filtered_cols = ['chas', 'zn', 'crim']

lr_model_filtered = LinearRegression()

lr_model_filtered.fit(X_train.loc[:, filtered_cols], y_train)

y_pred_filtered = lr_model_filtered.predict(X_test.loc[:, filtered_cols])

mse_f = mean_squared_error(y_test, y_pred_filtered)
mae_f = mean_absolute_error(y_test, y_pred_filtered)
r2s_f = r2_score(y_test, y_pred_filtered)

print("\nLinear Regression model metrics using selected attributes are:")
print("MSE = {0:.4f}, MAE = {1:.4f} & R2-Score = {2:.4f}\n\n".format(mse_f, mae_f, r2s_f))
# Linear Regression model metrics using selected attributes are:
# MSE = 83.6148, MAE = 6.2886 & R2-Score = 0.2141

# Linear Regression model metrics using selected attributes are:
# MSE = 46.3255, MAE = 4.9335 & R2-Score = 0.5646

现在,根据计算出的属性VIF分数,只有3个属性的VIF分数<;=10,即“chas”、“zn”、“crim”,因为在各种统计文献中,我读到VIF>;10的属性可能会被删除。在

如果我只使用这3个属性训练一个线性回归模型,那么平均绝对误差(MAE)、均方误差(MSE)和R2得分都不好,并且随着我不断添加更多属性而开始改善。在

那么,为什么VIF方法不能使用VIF<;=10的3个属性?在

谢谢!在


Tags: oftestdatamodel属性trainbostondataset

热门问题