我应该使用什么分类模型?机器学习新手。需要推荐

2024-10-16 17:27:39 发布

您现在位置:Python中文网/ 问答频道 /正文

目标是:

嘿,伙计们,我正在尝试用Python创建一个分类模型来预测自行车共享站每小时的相对流入量或流出量何时会过多。在

我们的工作内容:

我的数据帧的前5行(总共超过200000行)如下所示,我在“通量”列中指定了0、1、2值—如果没有显著的动作,则为0;如果流入过多,则为1;如果流出过多,则为2。在

enter image description here

我正在考虑使用站点名称(超过300个站点)、一天中的小时和星期几作为预测变量来分类“通量”。在

车型选择:

我该怎么办?天真的贝斯?克恩?随机森林?还有什么合适的吗?GDMs?高级副总裁?在

仅供参考:基线预测值始终为0相当高,为92.8%。不幸的是,logistic回归和决策树的准确度是相当的,并没有提高太多。而KNN只需要永远。。。。在

在处理这样的分类问题时,来自那些更有经验的机器学习者的建议?在


Tags: 数据模型名称内容目标站点森林分类
2条回答

在这种不平衡数据的情况下,只需使用与平均准确度不同的数据进行模型评估:精度/召回率/f1/混淆矩阵:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

尝试不同的模型,并根据测试集上选择的指标选择最佳。在

Azure机器学习团队有an article on how to choose algorithms,即使你不使用AzureML,它也会有所帮助。从那篇文章中:

How large is your training data? If your training set is small, and you're going to train a supervised classifier, then machine learning theory says you should stick to a classifier with high bias/low variance, such as Naive Bayes. These have an advantage over low bias/high variance classifiers such as kNN since the latter tends to overfit. But low bias/high variance classifiers are more appropriate if you have a larger training set because they have a smaller asymptotic error - in these cases a high bias classifier isn't powerful enough to provide an accurate model. There are theoretical and empirical results that indicate that Naive Bayes does well in such circumstances. But note that having better data and good features usually can give you a greater advantage than having a better algorithm. Also, if you have a very large dataset classification performance may not be affected as much by the algorithm you use, so in that case it's better to choose your algorithm based on such things as its scalability, speed, or ease of use.

Do you need to train incrementally or in a batched mode? If you have a lot of data, or your data is updated frequently, you probably want to use Bayesian algorithms that update well. Both neural nets and SVMs need to work on the training data in batch mode.

Is your data exclusively categorical or exclusively numeric or a mixture of both kinds? Bayesian works best with categorical/binomial data. Decision trees can't predict numerical values.

Do you or your audience need to understand how the classifier works? Bayesian or decision trees are more easily explained. It's much harder to see or explain how neural networks and SVMs classify data.

How fast does your classification need to be generated? Decision trees can be slow when the tree is complex. SVMs, on the other hand, classify more quickly since they only need to determine which side of the "line" your data is on.

How much complexity does the problem present or require? Neural nets and SVMs can handle complex non-linear classification.

现在,关于你关于“fyi:always 0的基线预测非常高,为92.8%”的评论:有异常检测算法-这意味着分类是高度不平衡的,其中一个分类是很少发生的“异常”,就像信用卡欺诈检测一样(真正的欺诈只占整个数据集的一小部分)。在Azure机器学习中,我们使用单类支持向量机(SVM)和基于PCA的异常检测算法。希望有帮助!在

相关问题 更多 >