我目前在python工作,试图学习如何使用财富500强数据集进行线性回归。到目前为止,我已经通过删除N.As清理了我的数据集。然而,由于我已经到了问题D,我不知道如何构建这个模型。根据我为x假设的指示,我将使用收入(以百万计),但是,我不知道x中还应该有什么。如何继续构建此模型
B部分:通过删除利润为N.A.的记录(行)来清理数据集,并查看收入和利润之间的关系
dfCleanX = df[ df['Profit (in millions)']!='N.A.']
dfCleanX.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25131 entries, 0 to 25499
Data columns (total 5 columns):
Year 25131 non-null int64
Rank 25131 non-null int64
Revenue (in millions) 25131 non-null float64
Profit (in millions) 25131 non-null object
Company 25131 non-null object
dtypes: float64(1), int64(2), object(2)
memory usage: 1.2+ MB
dfClean = dfCleanX.astype({'Profit (in millions)': 'float64'})
print(dfClean.values.shape )
dfClean.info()
(25131, 5)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25131 entries, 0 to 25499
Data columns (total 5 columns):
Year 25131 non-null int64
Rank 25131 non-null int64
Revenue (in millions) 25131 non-null float64
Profit (in millions) 25131 non-null float64
Company 25131 non-null object
dtypes: float64(2), int64(2), object(1)
memory usage: 1.2+ MB
dfClean.plot.scatter(x='Revenue (in millions)', y='Profit (in millions)')
<matplotlib.axes._subplots.AxesSubplot at 0x23e0222a3c8>
C部分:在这一部分中,我们只关注“正利润”的案例。我们想研究收入(即x)和利润(即y)之间的关系,建立一个线性模型y=a*x+b
可视化y与x的关系,其中y和x是利润(>;0)和收入
positiveProfitMask = dfClean['Profit (in millions)'] > 0
dfClean[ positiveProfitMask ].plot.scatter(
x='Revenue (in millions)',
y='Profit (in millions)'
)
<matplotlib.axes._subplots.AxesSubplot at 0x23e023b8358>
问题D:只关注“正利润”的案例。在下面的单元格中填写缺少的代码
from sklearn.linear_model import LinearRegression
x = dfClean[(Revenues (in millions) )][??? ]
y = dfClean[( Profits (in millions) )][??? ]
model = LinearRegression(fit_intercept=True)
model.fit(positiveProfitMask , y)
print( "model.coef_ =", model.coef_ )
print( "model.intercept_ =", model.intercept_ )
print( "Linear model about y(profit) and x(revenue): y=",
model.coef_, '* x +', model.intercept_)
yfit = model.predict(??? )
plt.scatter(x, y)
plt.plot(x, yfit, 'r');
如果只需要填写下面的第
yfit = model.predict(??? )
行,那么您只需要传递一个向量X,就可以看到您的模型将为给定的值预测什么。既然你只需要正利润,你需要从你的X过滤我们的第一个以下是如何在
pandas
中执行此操作:相关问题 更多 >
编程相关推荐