如何处理33000多个城市的分类变量？

3条回答

网友
1楼 · 编辑于 2024-09-30 03:25:03

您可以使用某种更好地反映这些城市的嵌入（并通过直接OHE压缩总特征的数量），可能使用一些特征来描述每个城市所属的地区，然后使用一些其他特征来描述国家/地区，等等
请注意，由于您没有提供有关此任务的任何具体细节，因此我在示例中仅使用了地理数据，但您可以使用与每个城市相关的其他变量，如平均温度、人口、面积等，具体取决于您在此处尝试解决的任务
另一种方法是用坐标（纬度和经度）替换城市名称。同样，这可能会有所帮助，具体取决于模型的任务
希望这有帮助

网友
2楼 · 编辑于 2024-09-30 03:25:03

除了模型之外，还可以通过按地理区域对要素（城市）进行分组来减少要素的数量。另一种选择是按人口规模对它们进行分组
另一种选择是使用分位数箱按频率对它们进行分组。目标编码可能是您的另一个选择
在许多情况下，特征工程涉及大量的手工工作，不幸的是，您不能总是自动地将所有内容分类

网友
3楼 · 编辑于 2024-09-30 03:25:03

XGBoost自1.3.0版以来还增加了对分类编码的实验支持

从another question复制我的答案

2020年11月23日

XGBoost从1.3.0版起就增加了对分类功能的实验性支持。从文档中：

1.8.7 Categorical Data
Other than users performing encoding, XGBoost has experimental support for categorical data using gpu_hist and gpu_predictor. No special operation needs to be done on input test data since the information about categories is encoded into the model during training.

https://buildmedia.readthedocs.org/media/pdf/xgboost/latest/xgboost.pdf

在DMatrix部分中，文档还说：

enable_categorical (boolean, optional) – New in version 1.3.0.
Experimental support of specializing for categorical features. Do not set to True unless you are interested in development. Currently it’s only available for gpu_hist tree method with 1 vs rest (one hot) categorical split. Also, JSON serialization format, gpu_predictor and pandas input are required.

其他型号选项：

如果您不需要使用XGBoost，您可以使用像LightGBM或CatBoost这样的模型，它们支持分类编码，而无需开箱即用的热编码

相关问题更多 >

编程相关推荐

热门问题

热门文章