一个python包,用于使用图分区拆分机器学习数据集
mlnods的Python项目详细描述
mlNODS公司
使用图划分分割机器学习数据集
适当的评估需要对培训和评估数据集进行适当的分割,而这又需要聚类。对于许多问题,单链接聚类就足够了。遇到这样一个标准程序无法解决的问题,我们开发了一个简单的基于图形的工具来创建唯一的数据集。在
mlNODS是一种基于图的方法,它允许将原始数据集分割成不重叠的集合,这些集合在不删除某些数据的情况下无法分组。mlNODS优化了以下约束:(1)保留尽可能多的数据点,(2)消除两个拆分集之间的任何重叠。图中的节点是原始数据点,连接是节点间相似性的度量(例如蛋白质集的序列相似性)。该方法首先建立完整的图,然后通过删除节点来优化相似表的约束。mlNODS适用于任何问题,并有一个额外的好处,即允许在一个集合内重叠(即同系物训练),而在两个集合之间不允许重叠(即训练和测试不重叠)。在
usage: mlnods [-h] -s SPLITS -c CUTOFF [-l LIMIT] -e EDGES_FILE [-f EDGES_FORMAT] -n NODES_FILE [-a][-r RANDOM][-o OUTFOLDER][-v][-q][--version] This is a script that will create independent sets of data Version: 1.0 [03/14/20] optional arguments: -h, --help show this help message and exit -s SPLITS, --splits SPLITS number of splits required -c CUTOFF, --cutoff CUTOFF similarity cutoff in the units of link scores -l LIMIT, --limit LIMIT limit on the number of links for each node (default=0, infinity) -e EDGES_FILE, --edges EDGES_FILE file containing a table of instances with link scores for each pair -f EDGES_FORMAT, --format EDGES_FORMAT format of the table file blast : takes a list of -m 9 formated blast files and builds a table based on seqID hssp : takes a list of -m 9 formated blast files, runs HSSP scoring script and builds an HSSP distance table self<int> : space/tab separated table file, similarity score in column <int> eg "ID1 ID2 similarity_score" will be addressed as self3 (default=self5) -n NODES_FILE, --nodes NODES_FILE instance file containing IDs of all instances being considered IDs are case-independent (eg ABC= abc) IDs are always preceeded by ">" and followed by a white space. No white spaces are allowed in an ID. If score is provided for an ID, it should be surrounded by spaces and directly follow the ID (eg. >abl1_human 10 gene associated with ....) Everything between two IDs is printed in the junction files, but not considered in evaluation -a, --abundance the option to score false : score retrieved from instance file, range [0-100], default=50 when missing true : score approximated by actual number of times an ID appears in the instance file -r RANDOM, --random RANDOM set a fixed random seed to generate consistent partitions -o OUTFOLDER, --outfolder OUTFOLDER path to output folder (default=<current directory> -v, --verbose set verbosity level -q, --quiet no logging to stdout --version show program's version number and exit If an ID is present in the instance file, but not in the table file the ID is considered to not be linked to anything else If an ID is present in the table file but not in the instance file, it is ignored mlnods was developed by Yana Bromberg and refactored by Maximilian Miller. Feel free to contact us for support at services@bromberglab.org. This software is licensed under [NPOSL-3.0](http://opensource.org/licenses/NPOSL-3.0)
- 项目
标签: