Python NanoSim-H包_程序模块 - PyPI

没有项目描述

NanoSim-H的Python项目详细描述

https://travis-ci.org/karel-brinda/NanoSim-H.svg?branch=master

https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square

关于

Nanosim-H是牛津纳米孔读取的模拟器，它捕获了ont数据的技术特性，并允许在改进纳米孔测序技术时进行调整。 NanoSim-H是从NanoSim，陈阳在Canada’s Michael Smith Genome Sciences Centre开发的软件包。 fork是从版本1.0.1创建的，nanosim-h和nanosim的版本保持同步。

nanoim-h是使用python使用r进行模型拟合来实现的。在硅片中，可以使用nanosim-h从给定的参考基因组模拟读取。 nanosim-h包是用几个预先计算的错误配置文件分发的，但是可以使用nanosim-h-train计算其他配置文件。

与nanosim相比，主要的改进是：

支持Python3
支持RNF读取名称
从PyPI
随主程序包分发的错误配置文件
使用Travis
可重复模拟（为prg设定种子）
带有新参数（例如，用于合并所有控件）和进度条的改进界面
修复了几个小错误

快速示例

从一个{em1}$e.coli基因组中读取100个数据的模拟。

pip install --upgrade nanosim-h
curl "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=fasta&val=545778205&sendto=on"|\
        nanosim-h -n 100 -

安装

来自BioConda（推荐）：

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda install -y nanosim-h

来自PyPI：

pip install --upgrade nanosim-h

来自github:

git clone https://github.com/karel-brinda/nanosim-h
cd nanosim-h
pip install --upgrade .

或

git clone https://github.com/karel-brinda/nanosim-h
cd nanosim-h
python setup.py install

依赖项：

对于读模拟：

Python（2.7，3.2-3.6）
Numpy

用于计算新的错误配置文件：

LAST（使用847版测试）
R

使用bioconda安装时，会自动安装所有nanosim-h依赖项。使用pip安装时，将自动安装读取模拟的所有依赖项。

读取模拟

模拟阶段以参考基因组和可能的读取配置文件作为输入，并以fasta格式输出模拟读取。

$ nanosim-h --help
usage: nanosim-h [-h] [-v] [-p str] [-o str] [-n int] [-u float] [-m float]
                 [-i float] [-d float] [-s int] [--circular] [--perfect]
                 [--merge-contigs] [--rnf] [--rnf-add-cigar] [--max-len int]
                 [--min-len int] [--kmer-bias int]
                 <reference.fa>

Program:  NanoSim-H - a simulator of Oxford Nanopore reads.
Version:  1.1.0.4
Authors:  Chen Yang <cheny@bcgsc.ca> - author of the original software package (NanoSim)
          Karel Brinda <kbrinda@hsph.harvard.edu> - author of the NanoSim-H fork

positional arguments:
  <reference.fa>        reference genome (- for standard input)

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -p str, --profile str
                        error profile - one of precomputed profiles
                        ('ecoli_R7.3', 'ecoli_R7', 'ecoli_R9_1D',
                        'ecoli_R9_2D', 'yeast', 'ecoli_UCSC1b') or own
                        directory with an error profile [ecoli_R9_2D]
  -o str, --out-pref str
                        prefix of output file [simulated]
  -n int, --number int  number of generated reads [10000]
  -u float, --unalign-rate float
                        rate of unaligned reads [detect from the error
                        profile]
  -m float, --mis-rate float
                        mismatch rate (weight tuning) [1.0]
  -i float, --ins-rate float
                        insertion rate (weight tuning) [1.0]
  -d float, --del-rate float
                        deletion rate (weight tuning) [1.0]
  -s int, --seed int    initial seed for the pseudorandom number generator (0
                        for random) [42]
  --circular            circular simulation (linear otherwise)
  --perfect             output perfect reads, no mutations
  --merge-contigs       merge contigs from the reference
  --rnf                 use RNF format for read names
  --rnf-add-cigar       add cigar to RNF names (not fully debugged, yet)
  --max-len int         maximum read length [inf]
  --min-len int         minimum read length [50]
  --kmer-bias int       prohibits homopolymers with length >= n bases in
                        output reads [6]

Examples: nanosim-h --circular ecoli_ref.fasta
          nanosim-h --circular --perfect ecoli_ref.fasta
          nanosim-h -p yeast --kmer-bias 0 yeast_ref.fasta

Notice: the use of `max-len` and `min-len` will affect the read length distributions. If
the range between `max-len` and `min-len` is too small, the program will run slowlier accordingly.

示例：

如果要模拟从e.coli基因组读取，则应使用循环模式，因为它是循环基因组。
^{tt3}$
如果您只想模拟完美的读取，即没有snp或indel，只需模拟读取长度分布。
^{tt4}$
如果你想模拟没有k-mer偏倚的s.cerevisiae基因组的读取，那么应该选择线性模式，因为它是线性基因组。
^{tt5}$

输出文件：

simulated.log–模拟过程的日志文件。
simulated.fa–模拟读取的fasta文件。读取可以包含有关它们是如何在rnf中或在原始nanosim命名约定中创建的信息。
RNF naming convention
See the associated RNF paper and RNF specification.
NanoSim naming convention
Each reads has “unaligned”, “aligned”, or “perfect” in the header determining their error rate. “unaligned” means that the reads have an error rate over 90% and cannot be aligned. “aligned” reads have the same error rate as training reads. “perfect” reads have no errors.
To explain the information in the header, we have two examples:
- ^{tt8}$
  All information before the first ^{tt9}$ are chromosome information. ^{tt10}$ is the start position and unaligned suggesting it should be unaligned to the reference. The first ^{tt11}$ is the sequence index. ^{tt12}$ represents a forward strand. ^{tt13}$ means that sequence length extracted from the reference is 3236 bases.
- ^{tt14}$
  This is an aligned read coming from chromosome XI at position 115406. ^{tt15}$ is the sequence index. R represents a reverse complement strand. ^{tt16}$ means that this read has 92-base head region (cannot be aligned), followed by 12710 bases of middle region, and then 2-base tail region.
The information in the header can help users to locate the read easily.
simulated.errors.txt–引入的错误列表。
The output contains error type, position, original bases and current bases.

错误配置文件

特征化阶段以参考和fasta格式的训练读取集作为输入。用户还可以提供自己的maf格式的对齐文件。

使用nanosim-h分发的配置文件：

ecoli_R7
ecoli_R7.3
ecoli_R9_1D
ecoli_R9_2D（读取模拟的默认错误配置文件）
ecoli_UCSC1b
yeast

新的错误配置文件：

可以使用nanosim-h-train命令获得新的错误配置文件。

$ nanosim-h-train --help
usage: nanosim-h-train [-h] [-v] [-i str] [-m str] [-b int] [--no-model-fit]
                       <reference.fa> <profile.dir>

Program:  NanoSim-H-Train - compute an error profile for NanoSim-H.
Version:  1.1.0.4
Authors:  Chen Yang <cheny@bcgsc.ca> - author of the original software package (NanoSim)
          Karel Brinda <kbrinda@hsph.harvard.edu> - author of the NanoSim-H fork

positional arguments:
  <reference.fa>        reference genome of the training reads
  <profile.dir>         error profile dir

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -i str, --infile str  training ONT real reads, must be fasta files
  -m str, --maf str     user can provide their own alignment file, with maf
                        extension
  -b int, --num-bins int
                        number of bins (for development) [20]
  --no-model-fit        no model fitting

与错误配置文件关联的文件：

aligned_length_ecdf–对齐读取上对齐区域的长度分布。
aligned_reads_ecdf–对齐读取的长度分布。
align_ratio–每次读取的对齐比率的经验分布。
besthit.maf-最好的alig每次读取的长度。
match.hist，mis.hist，ins.hist，del.hist–匹配、不匹配、插入和删除的直方图。
first_match.hist–每个对齐的第一个匹配长度的直方图。
error_markov_model–错误类型的马尔可夫模型。
ht_ratio–头部区域与总未对齐区域的经验分布。
training.maf–maf格式的最后一个对齐文件的输出。
match_markov_model–匹配长度的马尔可夫模型（正确基调用的延伸）。
model_profile–适合错误模型。
processed.maf–用户提供的对齐文件的重新格式化的maf文件。
unaligned_length_ecdf–未对齐读取的长度分布

欢迎加入QQ群-->： 979659372

NanoSim-H 1.1.0.4

NanoSim-H的Python项目详细描述

关于

快速示例

安装

读取模拟

错误配置文件

推荐PyPI第三方库

questions-three-selenium

typeguarder

random-publication-test

echarts-integration

georssgenericclient

flaskexcel

hgflow-official

py3-email

hpa_utils

predict-weather

detext

zeliapdf

new-rtorrent-python

appthreat-depscan

automlapi

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

NanoSim-H 1.1.0.4

NanoSim-H的Python项目详细描述

关于

快速示例

安装

读取模拟

错误配置文件

推荐PyPI第三方库

questions-three-selenium

typeguarder

random-publication-test

echarts-integration

georssgenericclient

flaskexcel

hgflow-official

py3-email

hpa_utils

predict-weather

detext

zeliapdf

new-rtorrent-python

appthreat-depscan

automlapi

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签