napkinXC是一个非常简单和快速的极端多类和多标签分类库。
napkinxc的Python项目详细描述
napkinXC公司
napkinXC是一个非常简单和快速的库,用于极端多类和多标签分类。 它允许用最少的资源用几行代码为非常大的数据集训练分类器。在
Python,现在,NAPKIXC在C++和C++中都实现了以下特性:
- 概率标签树(PLT)和在线概率标签树(OPLT)
- 分层softmax(HSM)
- 二进制相关性(BR)
- 一对一(OVR)
- 快速在线预测top-k标签或超过给定阈值的标签
- 用于树构建和其他树构建方法的分层k-means聚类
- 支持预定义的层次结构
- 用于基本分类器的LIBLINEAR、SGD和AdaGrad解算器
- 基于树的高效集成模型
- 帮助程序从XML Repository下载和加载数据
- 帮助衡量绩效。在
请注意,这个图书馆仍在开发中,同时也是实验基地。 有些实验特性可能没有记录。在
napkinXC是在麻省理工学院授权下发行的。 欢迎所有对该项目的贡献!在
路线图
即将推出:
- 可以使用Python中的任何类型的二进制分类器。在
- 不同阈值的有效预测。在
- 改进了Python中的数据集加载。在
- 来自更多XML数据集的存储库。在
Python快速入门和文档
napkinXC的文档可以在https://napkinxc.readthedocs.io上找到 并从该存储库生成。在
Python(3.5+)版本的napkinXC可以从Linux和MacOS上的PyPy存储库轻松安装, 它需要现代C++ 17编译器,CMake和Git安装:
pip install napkinxc
或者直接从GitHub存储库中获取最新主版本:
^{pr2}$最小使用示例:
from napkinxc.datasets import load_dataset
from napkinxc.models import PLT
from napkinxc.measures import precision_at_k
X_train, Y_train = load_dataset("eurlex-4k", "train")
X_test, Y_test = load_dataset("eurlex-4k", "test")
plt = PLT("eurlex-model")
plt.fit(X_train, Y_train)
Y_pred = plt.predict(X_test, top_k=1)
print(precision_at_k(Y_test, Y_pred, k=1))
更多示例可以在python/examples
目录下找到。在
可执行文件
napkinXC还可以作为可执行文件来训练和评估模型,并使用libsvm格式的数据进行预测
要生成可执行文件,请使用:
cmake .
make
命令行选项:
Usage: nxc <command> <args>
Commands:
train Train model on given input data
test Test model on given input data
predict Predict for given data
ofo Use online f-measure optimization
version Print napkinXC version
help Print help
Args:
General:
-i, --input Input dataset, required
-o, --output Output (model) dir, required
-m, --model Model type (default = plt)
Models: ovr, br, hsm, plt, oplt, svbopFull, svbopHf, brMips, svbopMips
--ensemble Number of models in ensemble (default = 1)
-t, --threads Number of threads to use (default = 0)
Note: -1 to use #cpus - 1, 0 to use #cpus
--hash Size of features space (default = 0)
Note: 0 to disable hashing
--featuresThreshold Prune features below given threshold (default = 0.0)
--seed Seed (default = system time)
--verbose Verbose level (default = 2)
Base classifiers:
--optimizer Optimizer used for training binary classifiers (default = libliner)
Optimizers: liblinear, sgd, adagrad, fobos
--bias Value of the bias features (default = 1)
--inbalanceLabelsWeighting Increase the weight of minority labels in base classifiers (default = 1)
--weightsThreshold Threshold value for pruning models weights (default = 0.1)
LIBLINEAR: (more about LIBLINEAR: https://github.com/cjlin1/liblinear)
-s, --liblinearSolver LIBLINEAR solver (default for log loss = L2R_LR_DUAL, for l2 loss = L2R_L2LOSS_SVC_DUAL)
Supported solvers: L2R_LR_DUAL, L2R_LR, L1R_LR,
L2R_L2LOSS_SVC_DUAL, L2R_L2LOSS_SVC, L2R_L1LOSS_SVC_DUAL, L1R_L2LOSS_SVC
-c, --liblinearC LIBLINEAR cost co-efficient, inverse of regularization strength, must be a positive float,
smaller values specify stronger regularization (default = 10.0)
--eps, --liblinearEps LIBLINEAR tolerance of termination criterion (default = 0.1)
SGD/AdaGrad:
-l, --lr, --eta Step size (learning rate) for online optimizers (default = 1.0)
--epochs Number of training epochs for online optimizers (default = 1)
--adagradEps Defines starting step size for AdaGrad (default = 0.001)
Tree:
-a, --arity Arity of tree nodes (default = 2)
--maxLeaves Maximum degree of pre-leaf nodes. (default = 100)
--tree File with tree structure
--treeType Type of a tree to build if file with structure is not provided
tree types: hierarchicalKmeans, huffman, completeKaryInOrder, completeKaryRandom,
balancedInOrder, balancedRandom, onlineComplete
K-Means tree:
--kmeansEps Tolerance of termination criterion of the k-means clustering
used in hierarchical k-means tree building procedure (default = 0.001)
--kmeansBalanced Use balanced K-Means clustering (default = 1)
Prediction:
--topK Predict top-k labels (default = 5)
--threshold Predict labels with probability above the threshold (default = 0)
--thresholds Path to a file with threshold for each label
--setUtility Type of set-utility function for prediction using svbopFull, svbopHf, svbopMips models.
Set-utility functions: uP, uF1, uAlfa, uAlfaBeta, uDeltaGamma
See: https://arxiv.org/abs/1906.08129
Set-Utility:
--alpha
--beta
--delta
--gamma
Test:
--measures Evaluate test using set of measures (default = "p@1,r@1,c@1,p@3,r@3,c@3,p@5,r@5,c@5")
Measures: acc (accuracy), p (precision), r (recall), c (coverage), hl (hamming loos)
p@k (precision at k), r@k (recall at k), c@k (coverage at k), s (prediction size)
有关详细信息,请参阅文档。在
参考和确认
此库实现了以下论文中的方法:
- Probabilistic Label Trees for Extreme Multi-label Classification
- Online Probabilistic Label Trees
- Efficient Algorithms for Set-Valued Prediction in Multi-Class Classification
PLT模型的另一个实现在extremeText库中提供, 它实现了NeurIPS paper中描述的方法。在
- 项目
标签: