一个python包,用于使用图分区拆分机器学习数据集

mlnods的Python项目详细描述


mlNODS公司

使用图划分分割机器学习数据集

适当的评估需要对培训和评估数据集进行适当的分割,而这又需要聚类。对于许多问题,单链接聚类就足够了。遇到这样一个标准程序无法解决的问题,我们开发了一个简单的基于图形的工具来创建唯一的数据集。在

mlNODS是一种基于图的方法,它允许将原始数据集分割成不重叠的集合,这些集合在不删除某些数据的情况下无法分组。mlNODS优化了以下约束:(1)保留尽可能多的数据点,(2)消除两个拆分集之间的任何重叠。图中的节点是原始数据点,连接是节点间相似性的度量(例如蛋白质集的序列相似性)。该方法首先建立完整的图,然后通过删除节点来优化相似表的约束。mlNODS适用于任何问题,并有一个额外的好处,即允许在一个集合内重叠(即同系物训练),而在两个集合之间不允许重叠(即训练和测试不重叠)。在

usage: mlnods [-h] -s SPLITS -c CUTOFF [-l LIMIT] -e EDGES_FILE
                [-f EDGES_FORMAT] -n NODES_FILE [-a][-r RANDOM][-o OUTFOLDER][-v][-q][--version]

This is a script that will create independent sets of data

Version: 1.0 [03/14/20]

optional arguments:
  -h, --help            show this help message and exit
  -s SPLITS, --splits SPLITS
                        number of splits required
  -c CUTOFF, --cutoff CUTOFF
                        similarity cutoff in the units of link scores
  -l LIMIT, --limit LIMIT
                        limit on the number of links for each node (default=0, infinity)
  -e EDGES_FILE, --edges EDGES_FILE
                        file containing a table of instances with link scores for each pair
  -f EDGES_FORMAT, --format EDGES_FORMAT
                        format of the table file

                        blast     : takes a list of -m 9 formated blast files and builds a table based on seqID
                        hssp      : takes a list of -m 9 formated blast files, runs HSSP scoring script and builds an HSSP distance table
                        self<int> : space/tab separated table file, similarity score in column <int>
                                    eg "ID1 ID2 similarity_score" will be addressed as self3 (default=self5)
  -n NODES_FILE, --nodes NODES_FILE
                        instance file containing IDs of all instances being considered

                        IDs are case-independent (eg ABC= abc)
                        IDs are always preceeded by ">" and followed by a white space.
                        No white spaces are allowed in an ID.
                        If score is provided for an ID, it should be surrounded by spaces and directly follow the ID
                        (eg. >abl1_human 10 gene associated with ....)
                        Everything between two IDs is printed in the junction files, but not considered in evaluation
  -a, --abundance       the option to score

                        false : score retrieved from instance file, range [0-100], default=50 when missing
                        true  : score approximated by actual number of times an ID appears in the instance file
  -r RANDOM, --random RANDOM
                        set a fixed random seed to generate consistent partitions
  -o OUTFOLDER, --outfolder OUTFOLDER
                        path to output folder (default=<current directory>
  -v, --verbose         set verbosity level
  -q, --quiet           no logging to stdout
  --version             show program's version number and exit

If an ID is present in the instance file, but not in the table file the ID is considered to not be linked to anything else
If an ID is present in the table file but not in the instance file, it is ignored

mlnods was developed by Yana Bromberg and refactored by Maximilian Miller.

Feel free to contact us for support at services@bromberglab.org.
This software is licensed under [NPOSL-3.0](http://opensource.org/licenses/NPOSL-3.0)

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
在java代码中实现两个侦听器时发生swing错误   Lambda是否完全取消了Java8中匿名内部类的使用?   swing OpenSuse 12.3+Java双显示   POM中的java错误。xml文件,即使在清理{users}/之后。m2/用于*上次更新文件的存储库   JavaEDT特定的方法和其他东西   java如何使用GridLayout设置组件大小?有更好的办法吗?   java在itext7中生成二维码时,如何调整点的大小?   java如何在多行上显示文本并右对齐?   java在WebSphereCluString环境中分离Log4j日志   JAVA从文件读取,返回BigInteger值   当使用rxjava2进行排列时,使用javamockito。重试()   在java fasterxml中创建Xml   使用64位整数进行模运算的64位整数的java快速乘法,无溢出   java静态变量保留以前发布的值   datastax enterprise SSTable loader流式处理无法提供java。木卫一。IOException:对等方重置连接   java匹配的通配符是严格的,但找不到元素“mvc:annotationdriven”的声明。标准包装。可抛出   java无法在浏览器上下载文件文档?