高性能系统发育多样性计算
unifrac的Python项目详细描述
Unifrac
标准发音{em1}$yew nih frak
用于高性能系统发育多样性计算的{em1}$事实上的存储库。此存储库中的方法基于Strided State UniFrac算法的实现,该算法比Fast UniFrac更快,占用的内存更少。跨步状态unifrac支持Unweighted UniFrac、Weighted UniFrac、Generalized UniFrac、Variance Adjusted UniFrac和meta UniFrac。 这个存储库还包括stacked faith(准备中的手稿),这是一种计算faith的pd的方法,比基于unifrac的快速reference implementation更快,占用的内存更少。
此存储库生成一个通过共享库公开的C API,可通过任何编程语言链接到该库。
引文
有关跨状态unifrac算法的详细描述,请参见McDonald et al. 2018 Nature Methods。请注意,这个包实现了多个unifrac变体,它们可能有自己的引用。详细信息可以在引文部分的命令行界面的帮助输出中找到,并包含在下面:
ssu
For UniFrac, please see:
McDonald et al. Nature Methods 2018; DOI: 10.1038/s41592-018-0187-8
Lozupone and Knight Appl Environ Microbiol 2005; DOI: 10.1128/AEM.71.12.8228-8235.2005
Lozupone et al. Appl Environ Microbiol 2007; DOI: 10.1128/AEM.01996-06
Hamady et al. ISME 2010; DOI: 10.1038/ismej.2009.97
Lozupone et al. ISME 2011; DOI: 10.1038/ismej.2010.133
For Generalized UniFrac, please see:
Chen et al. Bioinformatics 2012; DOI: 10.1093/bioinformatics/bts342
For Variance Adjusted UniFrac, please see:
Chang et al. BMC Bioinformatics 2011; DOI: 10.1186/1471-2105-12-118
faithpd
For Faith's PD, please see:
Faith Biological Conservation 1992; DOI: 10.1016/0006-3207(92)91201-3
安装
此时,有两种主要的方法来安装库。第一个是通过qiime 2,第二个是通过pip
。也可以使用sucpp/Makefile
或setup.py
克隆存储库并进行安装。
已在LLVM9.0.0(OS X>;=10.12)或GCC4.9.2(CentOS>;=6)和HDF5>;=1.8.17上执行编译。python安装需要python>;=3.5、numpy>;=1.12.1、scikit bio>;=0.5.1和cython>;=0.28.3。
安装时间最多几分钟。
安装(qiime2)
使用这个库的最简单方法是通过QIIME2。此算法的实现默认安装在qiime diversity beta-phylogenetic-alt
下。
安装(本机)
要安装,首先需要编译二进制文件。这假设hdf5 提供工具链和库。有关如何设置 可以找到堆栈here。
假设h5c++
在您的路径中,下面应该可以工作:
pip install -e .
注意:如果您使用的是conda
,建议使用
conda-forge
频道,例如:
conda install -c conda-forge hdf5
使用示例
下面是一些使用这个库的不同方法的简单示例。
qiime2
要在qiime2中使用跨步状态unifrac,需要提供一个FeatureTable[Frequency]
和一个Phylogeny[Rooted]
工件。例如:
qiime diversity beta-phylogenetic --i-table table-evenly-samples.qza \
--i-phylogeny a-tree.qza \
--o-distance-matrix resulting-distance-matrix.qza \
--p-metric unweighted_unifrac
Python
可以从python中直接访问库。如果在这种模式下运行,api方法需要biom格式v2.1.0表的文件路径,以及newick格式系统发育的文件路径。
$ python
Python 3.5.4 | packaged by conda-forge | (default, Aug 10 2017, 01:41:15)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import unifrac
>>> dir(unifrac)
['__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_api', '_meta', '_methods', 'generalized', 'meta', 'pkg_resources', 'ssu', 'stacked_faith', 'unweighted', 'weighted_normalized', 'weighted_unnormalized']
>>> print(unifrac.unweighted.__doc__)
Compute Unweighted UniFrac
Parameters
----------
table : str
A filepath to a BIOM-Format 2.1 file.
phylogeny : str
A filepath to a Newick formatted tree.
threads : int, optional
The number of threads to use. Default of 1.
variance_adjusted : bool, optional
Adjust for varianace or not. Default is False.
bypass_tips : bool
Bypass the tips of the tree in the computation. This reduces compute
by about 50%, but is an approximation.
Returns
-------
skbio.DistanceMatrix
The resulting distance matrix.
Raises
------
IOError
If the tree file is not found
If the table is not found
ValueError
If the table does not appear to be BIOM-Format v2.1.
If the phylogeny does not appear to be in Newick format.
Notes
-----
Unweighted UniFrac was originally described in [1]_. Variance Adjusted
UniFrac was originally described in [2]_, and while its application to
Unweighted UniFrac was not described, factoring in the variance adjustment
is still feasible and so it is exposed.
References
----------
.. [1] Lozupone, C. & Knight, R. UniFrac: a new phylogenetic method for
comparing microbial communities. Appl. Environ. Microbiol. 71, 8228-8235
(2005).
.. [2] Chang, Q., Luan, Y. & Sun, F. Variance adjusted weighted UniFrac: a
powerful beta diversity measure for comparing communities based on
phylogeny. BMC Bioinformatics 12:118 (2011).
>>> print(unifrac.faith_pd.__doc__)
Execute a call to the Stacked Faith API in the UniFrac package
Parameters
----------
biom_filename : str
A filepath to a BIOM 2.1 formatted table (HDF5)
tree_filename : str
A filepath to a Newick formatted tree
Returns
-------
pd.Series
Series of Faith's PD for each sample in `biom_filename`
Raises
------
IOError
If the tree file is not found
If the table is not found
If the table is empty
命令行
这些方法也可以在安装后通过命令行直接使用:
$ which ssu
/Users/<username>/miniconda3/envs/qiime2-20xx.x/bin/ssu
$ ssu --help
usage: ssu -i <biom> -o <out.dm> -m [METHOD] -t <newick> [-n threads] [-a alpha] [--vaw]
-i The input BIOM table.
-t The input phylogeny in newick.
-m The method, [unweighted | weighted_normalized | weighted_unnormalized | generalized].
-o The output distance matrix.
-n [OPTIONAL] The number of threads, default is 1.
-a [OPTIONAL] Generalized UniFrac alpha, default is 1.
-f [OPTIONAL] Bypass tips, reduces compute by about 50%.
--vaw [OPTIONAL] Variance adjusted, default is to not adjust for variance.
Citations:
For UniFrac, please see:
Lozupone and Knight Appl Environ Microbiol 2005; DOI: 10.1128/AEM.71.12.8228-8235.2005
Lozupone et al. Appl Environ Microbiol 2007; DOI: 10.1128/AEM.01996-06
Hamady et al. ISME 2010; DOI: 10.1038/ismej.2009.97
Lozupone et al. ISME 2011; DOI: 10.1038/ismej.2010.133
For Generalized UniFrac, please see:
Chen et al. Bioinformatics 2012; DOI: 10.1093/bioinformatics/bts342
For Variance Adjusted UniFrac, please see:
Chang et al. BMC Bioinformatics 2011; DOI: 10.1186/1471-2105-12-118
$ which faithpd
/Users/<username>/miniconda3/envs/qiime2-20xx.x/bin/faithpd
$ faithpd --help
usage: faithpd -i <biom> -t <newick> -o <out.txt>
-i The input BIOM table.
-t The input phylogeny in newick.
-o The output series.
Citations:
For Faith's PD, please see:
Faith Biological Conservation 1992; DOI: 10.1016/0006-3207(92)91201-3
共享库访问
除了上述访问unifrac的方法外,还可以链接到共享库。c api在sucpp/api.hpp
中有描述,并且可以在examples/
中找到与此api链接的示例。
次要测试数据集
在sucpp/
中可以找到一个小测试.biom
和.tre
。下面是一个具有预期输出的示例,应该在10毫秒后执行:
$ ssu -i sucpp/test.biom -t sucpp/test.tre -m unweighted -o test.out
$ cat test.out
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
Sample1 0 0.2 0.5714285714285714 0.6 0.5 0.2
Sample2 0.2 0 0.4285714285714285 0.6666666666666666 0.6 0.3333333333333333
Sample3 0.5714285714285714 0.4285714285714285 0 0.7142857142857143 0.8571428571428571 0.4285714285714285
Sample4 0.6 0.6666666666666666 0.7142857142857143 0 0.3333333333333333 0.4
Sample5 0.5 0.6 0.8571428571428571 0.3333333333333333 0 0.6
Sample6 0.2 0.3333333333333333 0.4285714285714285 0.4 0.6 0