没有项目描述
NanoSim-H的Python项目详细描述
关于
Nanosim-H是牛津纳米孔读取的模拟器,它捕获了ont数据的技术特性, 并允许在改进纳米孔测序技术时进行调整。 NanoSim-H是从NanoSim, 陈阳在Canada’s Michael Smith Genome Sciences Centre开发的软件包。 fork是从版本1.0.1创建的,nanosim-h和nanosim的版本保持同步。
nanoim-h是使用python使用r进行模型拟合来实现的。 在硅片中,可以使用nanosim-h从给定的参考基因组模拟读取。 nanosim-h包是用几个预先计算的错误配置文件分发的,但是 可以使用nanosim-h-train计算其他配置文件。
与nanosim相比,主要的改进是:
快速示例
从一个{em1}$e.coli基因组中读取100个数据的模拟。
pip install --upgrade nanosim-h curl "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=fasta&val=545778205&sendto=on"|\ nanosim-h -n 100 -
安装
来自BioConda(推荐):
conda config --add channels defaults conda config --add channels conda-forge conda config --add channels bioconda conda install -y nanosim-h
来自PyPI:
pip install --upgrade nanosim-h
来自github:
git clone https://github.com/karel-brinda/nanosim-h
cd nanosim-h
pip install --upgrade .
或
git clone https://github.com/karel-brinda/nanosim-h
cd nanosim-h
python setup.py install
依赖项:
对于读模拟:
用于计算新的错误配置文件:
使用bioconda安装时,会自动安装所有nanosim-h依赖项。 使用pip安装时,将自动安装读取模拟的所有依赖项。
读取模拟
模拟阶段以参考基因组和可能的读取配置文件作为输入,并以fasta格式输出模拟读取。
$ nanosim-h --help usage: nanosim-h [-h] [-v] [-p str] [-o str] [-n int] [-u float] [-m float] [-i float] [-d float] [-s int] [--circular] [--perfect] [--merge-contigs] [--rnf] [--rnf-add-cigar] [--max-len int] [--min-len int] [--kmer-bias int] <reference.fa> Program: NanoSim-H - a simulator of Oxford Nanopore reads. Version: 1.1.0.4 Authors: Chen Yang <cheny@bcgsc.ca> - author of the original software package (NanoSim) Karel Brinda <kbrinda@hsph.harvard.edu> - author of the NanoSim-H fork positional arguments: <reference.fa> reference genome (- for standard input) optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit -p str, --profile str error profile - one of precomputed profiles ('ecoli_R7.3', 'ecoli_R7', 'ecoli_R9_1D', 'ecoli_R9_2D', 'yeast', 'ecoli_UCSC1b') or own directory with an error profile [ecoli_R9_2D] -o str, --out-pref str prefix of output file [simulated] -n int, --number int number of generated reads [10000] -u float, --unalign-rate float rate of unaligned reads [detect from the error profile] -m float, --mis-rate float mismatch rate (weight tuning) [1.0] -i float, --ins-rate float insertion rate (weight tuning) [1.0] -d float, --del-rate float deletion rate (weight tuning) [1.0] -s int, --seed int initial seed for the pseudorandom number generator (0 for random) [42] --circular circular simulation (linear otherwise) --perfect output perfect reads, no mutations --merge-contigs merge contigs from the reference --rnf use RNF format for read names --rnf-add-cigar add cigar to RNF names (not fully debugged, yet) --max-len int maximum read length [inf] --min-len int minimum read length [50] --kmer-bias int prohibits homopolymers with length >= n bases in output reads [6] Examples: nanosim-h --circular ecoli_ref.fasta nanosim-h --circular --perfect ecoli_ref.fasta nanosim-h -p yeast --kmer-bias 0 yeast_ref.fasta Notice: the use of `max-len` and `min-len` will affect the read length distributions. If the range between `max-len` and `min-len` is too small, the program will run slowlier accordingly.
示例:
如果要模拟从e.coli基因组读取,则应使用循环模式,因为它是循环基因组。
^{tt3}$
如果您只想模拟完美的读取,即没有snp或indel,只需模拟读取长度分布。
^{tt4}$
如果你想模拟没有k-mer偏倚的s.cerevisiae基因组的读取,那么应该选择线性模式,因为它是线性基因组。
^{tt5}$
输出文件:
simulated.log–模拟过程的日志文件。
simulated.fa–模拟读取的fasta文件。读取可以包含有关它们是如何在rnf中或在原始nanosim命名约定中创建的信息。
RNF naming convention
See the associated RNF paper and RNF specification.
NanoSim naming convention
Each reads has “unaligned”, “aligned”, or “perfect” in the header determining their error rate. “unaligned” means that the reads have an error rate over 90% and cannot be aligned. “aligned” reads have the same error rate as training reads. “perfect” reads have no errors.
To explain the information in the header, we have two examples:
- ^{tt8}$
- All information before the first ^{tt9}$ are chromosome information. ^{tt10}$ is the start position and unaligned suggesting it should be unaligned to the reference. The first ^{tt11}$ is the sequence index. ^{tt12}$ represents a forward strand. ^{tt13}$ means that sequence length extracted from the reference is 3236 bases.
- ^{tt14}$
- This is an aligned read coming from chromosome XI at position 115406. ^{tt15}$ is the sequence index. R represents a reverse complement strand. ^{tt16}$ means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.
The information in the header can help users to locate the read easily.
simulated.errors.txt–引入的错误列表。
The output contains error type, position, original bases and current bases.
错误配置文件
特征化阶段以参考和fasta格式的训练读取集作为输入。用户还可以提供自己的maf格式的对齐文件。
使用nanosim-h分发的配置文件:
- ecoli_R7
- ecoli_R7.3
- ecoli_R9_1D
- ecoli_R9_2D(读取模拟的默认错误配置文件)
- ecoli_UCSC1b
- yeast
新的错误配置文件:
可以使用nanosim-h-train命令获得新的错误配置文件。
$ nanosim-h-train --help usage: nanosim-h-train [-h] [-v] [-i str] [-m str] [-b int] [--no-model-fit] <reference.fa> <profile.dir> Program: NanoSim-H-Train - compute an error profile for NanoSim-H. Version: 1.1.0.4 Authors: Chen Yang <cheny@bcgsc.ca> - author of the original software package (NanoSim) Karel Brinda <kbrinda@hsph.harvard.edu> - author of the NanoSim-H fork positional arguments: <reference.fa> reference genome of the training reads <profile.dir> error profile dir optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit -i str, --infile str training ONT real reads, must be fasta files -m str, --maf str user can provide their own alignment file, with maf extension -b int, --num-bins int number of bins (for development) [20] --no-model-fit no model fitting
与错误配置文件关联的文件:
- aligned_length_ecdf–对齐读取上对齐区域的长度分布。
- aligned_reads_ecdf–对齐读取的长度分布。
- align_ratio–每次读取的对齐比率的经验分布。
- besthit.maf-最好的alig每次读取的长度。
- match.hist,mis.hist,ins.hist,del.hist–匹配、不匹配、插入和删除的直方图。
- first_match.hist–每个对齐的第一个匹配长度的直方图。
- error_markov_model–错误类型的马尔可夫模型。
- ht_ratio–头部区域与总未对齐区域的经验分布。
- training.maf–maf格式的最后一个对齐文件的输出。
- match_markov_model–匹配长度的马尔可夫模型(正确基调用的延伸)。
- model_profile–适合错误模型。
- processed.maf–用户提供的对齐文件的重新格式化的maf文件。
- unaligned_length_ecdf–未对齐读取的长度分布