使用snakemake通配符重命名文件

2024-06-26 13:32:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我很难解决我在snakemake中遇到的一个问题。我的样本目前命名为“1-2-Brain_187_006_S77_L002_R1_001.fastq.gz”。我希望最终将它们重命名为一个较短的名称,如“1_2_Brain_S77_L002_R1”,然后使用扩展名“_trim.fastq.gz”作为我的规则。我正在用bbduk修剪。对于我的输入,我想调用我的字典列表allSamples。然后我想访问每个字典中的值。具体而言,“shortName1”和“shortName2”值。我的问题是在我的dry中,它将整个列表显示为一次运行的输入。我不知道如何使它注册为每个元素都是它自己的运行。我以3个文件名为例,实际上我有114个文件名。因此,我希望我的试运行有114个计数用于修剪工作

config.json

{
   "allSamples" : ['1_2_Brain_S77_L002', '10_4_Kidney_S82_L002', '11_4_BB_S105_L002' ......],

   "1_2_Brain_S77_L002":{
        "sampleName1": "1-2-Brain_187_006_S77_L002_R1_001.fastq.gz",
        "sampleName2": "1-2-Brain_187_006_S77_L002_R2_001.fastq.gz",
        "shortName1": "1_2_Brain_S77_L002_R1",
        "shortName2": "1_2_Brain_S77_L002_R2",
        "stemName": "1_2_Brain_S77_L002"
        }, ....
}

我正在获取位于rawReads/中的文件,并将新修剪的文件存储在trimmedReads/中

蛇锉

configfile: "refs/config.json"

# variables
sampleDict = config["allSamples"]
sampleNames1 = [config[i]["sampleName1"] for i in sampleDict]
sampleNames2 = [config[i]["sampleName2"] for i in sampleDict]
shortNames1 = [config[i]["shortName1"] for i in sampleDict]
shortNames2 = [config[i]["shortName2"] for i in sampleDict]

rule all:
    input: 
        expand("trimmedReads/{trim1}_trim.fastq.gz", trim1 = shortNames1),
        expand("trimmedReads/{trim2}_trim.fastq.gz", trim2 = shortNames2)

rule trim:
    input:
        R1 = expand("rawReads/{sample1}", sample1 = sampleNames1),
        R2 = expand("rawReads/{sample2}", sample2 = sampleNames2)
    output:
        trim1 = expand("trimmedReads/{trim1}_trim.fastq.gz", trim1 = shortNames1),
        trim2 = expand("trimmedReads/{trim2}_trim.fastq.gz", trim2 = shortNames2)
    shell:
        """
        bbduk.sh in1={input.R1} in2={input.R2} out1={output.trim1} out2={output.trim2} ref=ref/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
        """

当我做一次试跑时,我得到了这个

Building DAG of jobs...
Job counts:
    count   jobs
    1   all
    1   trim
    2

[Mon May 24 22:42:36 2021]
rule trim:
    input: rawReads/1-2-Brain_187_006_S77_L002_R1_001.fastq.gz, rawReads/10-4-Kidney_127_066_S82_L002_R1_001.fastq.gz, rawReads/11-4_BB_041_152_S105_L002_R1_001.fastq.gz, ...
    output: trimmedReads/1_2_Brain_S77_L002_R1_trim.fastq.gz, trimmedReads/10_4_Kidney_S82_L002_R1_trim.fastq.gz, trimmedReads/11_4_BB_S105_L002_R1_trim.fastq.gz, ...
    jobid: 1


bbduk.sh in1=rawReads/1-2-Brain_187_006_S77_L002_R1_001.fastq.gz rawReads/10-4-Kidney_127_066_S82_L002_R1_001.fastq.gz rawReads/11-4_BB_041_152_S105_L002_R1_001.fastq.gz ... out1=trimmedReads/1_2_Brain_S77_L002_R1_trim.fastq.gz trimmedReads/10_4_Kidney_S82_L002_R1_trim.fastq.gz trimmedReads/11_4_BB_S105_L002_R1_trim.fastq.gz ... ref=ref/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
        

[Mon May 24 22:42:36 2021]
localrule all:
    input: trimmedReads/1_2_Brain_S77_L002_R1_trim.fastq.gz, trimmedReads/10_4_Kidney_S82_L002_R1_trim.fastq.gz, trimmedReads/11_4_BB_S105_L002_R1_trim.fastq.gz, ...
    jobid: 0

Job counts:
    count   jobs
    1   all
    1   trim
    2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Tags: configinputfastqbbgzr1trimbrain
1条回答
网友
1楼 · 发布于 2024-06-26 13:32:48

rule test:的通配符是空的dict。在此规则中没有指定通配符值wildcards.sample。每个通配符都应在output:部分中指定,对于该规则,该部分为空。实际上,除非明确地将rule test:指定为目标,否则rule test:绝对没有效果:如果没有指定任何输出,Snakemake只会忽略这个没有任何结果的无用规则

我猜文件["rawReads/1_2_Brain_S77_L002", "rawReads/17_6_Brain_S83_L002"]已经存在,因此Snakemake发现目标存在于磁盘上,并且什么也不做,产生“无输出”

我不明白你所说的“最终将它们重命名为一个较短的名称”是什么意思,但这里有一个如何复制文件的方法。将此作为“如何使用通配符访问我的示例名称”的模式:

rule all:
    input: ["path_to_target/foo_SampleName1_bar", "path_to_target/foo_SampleName2_bar"]
    # List the files you expect to get as a target

rule copy:
    input:
        "path_to_source/blablabla_{sample}_bazz"
    output:
        "path_to_target/foo_{sample}_bar"
    shell:
        "echo {input}; cp {input} {output}"

工作原理:

  1. Snakemake发现它需要生成一些文件(在这种情况下,这些文件是“路径到目标/foo\u SampleName1\u条”、“路径到目标/foo\u SampleName2\u条”)
  2. Snakemake发现rule copy:声明的输出(如果用值"SampleName1"替换{sample})与文件名"path_to_target/foo_SampleName1_bar"匹配
  3. 如果文件"path_to_source/blablabla_SampleName1_bazz"存在,则Snakemake满足要求,并且知道如何生成文件"path_to_target/foo_SampleName1_bar"
  4. {sample}"SampleName2"重复步骤2和3
  5. 现在它知道rule copy:应该运行两次:每个文件一次
  6. 所有依赖项都已解决,Snakemake可以启动管道

相关问题 更多 >