Snakemake:不存在的通配符出错

2024-10-02 12:29:02 发布

您现在位置:Python中文网/ 问答频道 /正文

Edit 2 : I figure it out. I posted my answer as reply.

Edit 1 : I added a beginning of solution at the end of the question following @bli advices and https://stackoverflow.com/a/41185568/1025741

我正在编写一个snakemake文件,在这个文件中我解析了一个示例表文件(在yaml配置文件中定义),以便连接这个示例表中列出的文件。在

示例表如下:

sample  unit    fq1 fq2
A   lane1   A.l1.1.R1.txt   A.l1.1.R2.txt
A   lane1   A.l1.2.R1.txt   A.l1.2.R2.txt
A   lane2   A.l2.R1.txt A.l2.R2.txt
B   lane1   B.l1.R1.txt B.l1.R2.txt

其思想是连接来自同一样本和样本单元的文件(列在fq1和fq2中)。在这种情况下:

  • A.l1.1.R1.txtA.l2.2.R1.txt将被连接
  • A.l1.1.R2.txt和{}将被串联

其他文件将不会串联,但也会在该目录结构中报告:

^{pr2}$

所以在这个例子的最后,我应该:

A/
  A_lane1_merged_R1.txt
  A_lane1_merged_R2.txt
  A_lane2_merged_R1.txt
  A_lane2_merged_R2.txt
B/
  B_lane1_merged_R1.txt
  B_lane1_merged_R2.txt

下面是我的snakemake文件来执行这样的任务:

import pandas as pd
shell.executable("bash")

configfile: "config.yaml"

# open samplesheet
units = pd.read_table(config["units"], dtype=str)
units = units.set_index(["sample", "unit"])


rule all:
    input:
        expand("{sample}/{sample}_{unit}_merge_R1.txt",
            sample=units.index.get_level_values('sample').unique(),
            unit=units.index.get_level_values('unit').unique()),
        expand("{sample}/{sample}_{unit}_merge_R2.txt",
            sample=units.index.get_level_values('sample').unique(),
            unit=units.index.get_level_values('unit').unique())


def get_fastq_r1(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()

def get_fastq_r2(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()


rule merge:
    input:
        r1 = get_fastq_r1,
        r2 = get_fastq_r2
    output:
        "{sample}/{sample}_{unit}_merge_R1.txt",
        "{sample}/{sample}_{unit}_merge_R2.txt"
    shell:
        """
        echo {input.r1} > {sample}/{sample}_{unit}_merge_R1.txt
        echo {input.r2} > {sample}/{sample}_{unit}_merge_R2.txt
        """

以及配置.yaml公司名称:

units: units.tsv

但是我有一个错误,因为我没有一个单元为lane2的示例B

InputFunctionException in line 29 of /home/nrosewick/Documents/analysis/pilot_data_ADX17009/workflow/test_snakemake/Snakefile:
KeyError: ('B', 'lane2')
Wildcards:
sample=B
unit=lane2

有没有避免这种错误的方法/技巧? 谢谢

Beginning of solution

已使用的@I建议的bli版本itertools.product通过将其包装到高阶生成器中,该生成器检查生成的通配符组合是否在预先建立的列表中:

import pandas as pd
shell.executable("bash")

configfile: "config.yaml"

### 
from itertools import product

def filter_combinator(combinator, inlist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in combinator(*args, **kwargs):
            # Use frozenset instead of tuple
            # in order to accomodate
            # unpredictable wildcard order
            if frozenset(wc_comb) in inlist:
                yield wc_comb
    return filtered_combinator

# open samplesheet
units = pd.read_table(config["units"], dtype=str)

# list of pair sample-unit included in the samplesheet
inList={
    frozenset({("sample", "A"), ("unit", "lane1")}),
    frozenset({("sample", "A"), ("unit", "lane2")}),
    frozenset({("sample", "B"), ("unit", "lane1")})}

# set df index
units = units.set_index(["sample", "unit"])

# build new iterator
filtered_product = filter_combinator(product, inList)

rule all:
    input:
        expand("{sample}/{sample}_{unit}_merge_R1.txt",
            filtered_product,
            sample=units.index.get_level_values('sample').unique().values,
            unit=units.index.get_level_values('unit').unique().values),
        expand("{sample}/{sample}_{unit}_merge_R2.txt",
            filtered_product,
            sample=units.index.get_level_values('sample').unique().values,
            unit=units.index.get_level_values('unit').unique().values)


def get_fastq_r1(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()

def get_fastq_r2(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()

rule merge:
    input:
        r1 = get_fastq_r1,
        r2 = get_fastq_r2
    output:
        "{sample}/{sample}_{unit}_merge_R1.txt",
        "{sample}/{sample}_{unit}_merge_R2.txt"
    message:
        "test"
    shell:
        """
        cat {input.r1} > {sample}/{sample}_{unit}_merge_R1.txt
        cat {input.r2} > {sample}/{sample}_{unit}_merge_R2.txt
        """

但在运行snakemake -n时,它会返回一个错误:

Job 1: test

RuleException in line 53 of /home/nrosewick/Documents/analysis/pilot_data_ADX17009/workflow/test_snakemake/Snakefile:
NameError: The name 'sample' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}

有什么线索吗?在


Tags: sampletxtinputgetindexunitmergelevel
1条回答
网友
1楼 · 发布于 2024-10-02 12:29:02

以下是我基于https://stackoverflow.com/a/41185568/1025741找到的解决方案:

import pandas as pd
shell.executable("bash")

configfile: "config.yaml"

### 
from itertools import product

def filter_combinator(combinator, inlist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in combinator(*args, **kwargs):
            # Use frozenset instead of tuple
            # in order to accomodate
            # unpredictable wildcard order
            if frozenset(wc_comb) in inlist:
                yield wc_comb
    return filtered_combinator

# open samplesheet
units = pd.read_table(config["units"], dtype=str)

# list of pair sample-unit
#inList=units[["sample","unit"]].drop_duplicates().to_dict('r')
inList={
    frozenset({("sample", "A"), ("unit", "lane1")}),
    frozenset({("sample", "A"), ("unit", "lane2")}),
    frozenset({("sample", "B"), ("unit", "lane1")})}

# set df index
units=units.set_index(["sample","unit"])

# build new iterator
filtered_product = filter_combinator(product, inList)

rule all:
    input:
        expand("{sample}/{sample}_{unit}_merge_R1.txt",
            filtered_product,
            sample=units.index.get_level_values('sample').unique().values,
            unit=units.index.get_level_values('unit').unique().values),
        expand("{sample}/{sample}_{unit}_merge_R2.txt",
            filtered_product,
            sample=units.index.get_level_values('sample').unique().values,
            unit=units.index.get_level_values('unit').unique().values)


def get_fastq_r1(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq1"]].dropna().values.flatten()

def get_fastq_r2(wildcards):
    return units.loc[(wildcards.sample, wildcards.unit), ["fq2"]].dropna().values.flatten()

rule merge:
    input:
        r1=get_fastq_r1,
        r2=get_fastq_r2
    output:
        r1_o="{sample}/{sample}_{unit}_merge_R1.txt",
        r2_o="{sample}/{sample}_{unit}_merge_R2.txt"
    message:
        "test"
    shell:
        """
        cat {input.r1} > {output.r1_o}
        cat {input.r2} > {output.r2_o}
        """

相关问题 更多 >

    热门问题