在我使用scikit_learn和pandas训练了一个模型之后，我如何预测未来的数据（在我的例子中是降雨）？

Station Yea Month Day Rainfall dayofyear 1970-01-01 1 Dhaka 1970 1 1 0 1 1970-01-02 1 Dhaka 1970 1 2 0 2 1970-01-03 1 Dhaka 1970 1 3 0 3 1970-01-04 1 Dhaka 1970 1 4 0 4 1970-01-05 1 Dhaka 1970 1 5 0 5

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import tensorflow as tf #data is in local folder df = pd.read_csv("data.csv") df.head(5) df.drop(df[(df['Day']>28) & (df['Month']==2) & (df['Year']%4!=0)].index,inplace=True) df.drop(df[(df['Day']>29) & (df['Month']==2) & (df['Year']%4==0)].index,inplace=True) df.drop(df[(df['Day']>30) & ((df['Month']==4)|(df['Month']==6)|(df['Month']==9)|(df['Month']==11))].index,inplace=True) date = [str(y)+'-'+str(m)+'-'+str(d) for y, m, d in zip(df.Year, df.Month, df.Day)] df.index = pd.to_datetime(date) df['date'] = df.index df['dayofyear']=df['date'].dt.dayofyear df.drop('date',axis=1,inplace=True) df.head() df.size() df.info() df.plot(x='Year',y='Rainfall',style='.', figsize=(15,5)) train = df.loc[df['Year'] <= 2015] test = df.loc[df['Year'] == 2016] train=train[train['Station']=='Dhaka'] test=test[test['Station']=='Dhaka'] X_train=train.drop(['Station','StationIndex','dayofyear'],axis=1) Y_train=train['Rainfall'] X_test=test.drop(['Station','StationIndex','dayofyear'],axis=1) Y_test=test['Rainfall'] from sklearn import svm from sklearn.svm import SVC model = svm.SVC(gamma='auto',kernel='linear') model.fit(X_train, Y_train) Y_pred = model.predict(X_test) df1 = pd.DataFrame({'Actual Rainfall': Y_test, 'Predicted Rainfall': Y_pred}) df1[df1['Predicted Rainfall']!=0].head(10)

2条回答

网友

1楼 · 编辑于 2024-06-26 02:35:13

数据非常简单。如果你参加一个kaggle竞赛，那么可解释性也不是一个大问题，只有准确性，你可以使用任何复杂的模型并获得好的结果。然而，如果我想要解释性，那么我将使用深度不超过4的决策树。减小深度，您将看到更通用的决策树。它会让你对数据有很好的了解

有些建议可以是：

删除所有的日、月列，这些信息已经存储在Day of year属性中（leap yrs实际上没有那么大的问题）
只剩下三列：年、站和一年中的某一天
查看年份列是否重要（决策树的重要决策出现在前2-3个深度），如果不重要，可以删除它。在现实世界中，变化更不可预测，模型越是广义化，它就越好。车站和日期是重要的考虑因素，不可忽视

然后检查复杂的模型，它们是否提高了您的准确性？他们可能会

如果他们真的这样做了，那么就使用它们，或者使用更简单的模型，因为它们具有更高的可解释性，更快的计算时间

网友

2楼 · 编辑于 2024-06-26 02:35:13

我注意到一个非常简单的错误：

X_train=train.drop(['Station','StationIndex','dayofyear'],axis=1)
Y_train=train['Rainfall']
X_test=test.drop(['Station','StationIndex','dayofyear'],axis=1)
Y_test=test['Rainfall']

您尚未从培训数据中删除Rainfall列

我大胆假设一下，你在训练和测试中都能获得100%的准确率，对吗？这就是原因。您的模型可以看到，在训练数据的“降雨”列中出现的任何东西都是答案，因此它在测试过程中准确地做到了这一点，从而获得了完美的结果，但事实上它根本无法预测任何东西

试着像这样跑步：

X_train=train.drop(['Station','StationIndex','dayofyear','Rainfall'],axis=1)
Y_train=train['Rainfall']
X_test=test.drop(['Station','StationIndex','dayofyear','Rainfall'],axis=1)
Y_test=test['Rainfall']

from sklearn import svm
model = svm.SVC(gamma='auto',kernel='linear')
model.fit(X_train, Y_train)
print('Accuracy on training set: {:.2f}%'.format(100*model.score(X_train, Y_train)))
print('Accuracy on testing set: {:.2f}%'.format(100*model.score(X_test, Y_test)))

相关问题更多 >

编程相关推荐

热门问题

热门文章