基于列合并文件

2024-09-30 10:40:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我有100个文件要根据文件中的mir_seq合并。输出应该是一个包含原始文件中的mir_seqfreq列的文件。你知道吗

文件如下所示:

文件1:

 mir_seq                                    seq                      name                   freq    mir start   end mism    add t5  t3  s5  s3  DB  ambiguity
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT  TGAGAAGAAGCACTGTAGCTCTT seq_100006_x0     0 hsa-miR-143-3p  61  81  6AT u-TT    0   0   AGTCTGAG    GCTCAGGA    miRNA   1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTA  GACCCTGTAGATCCGAATTTGTA seq_100012_x1   1   hsa-miR-10a-5p  22  43  1GT u-A 0   u-G TATATACC    TGTGTAAG    miRNA   1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTG  GACCCTGTAGATCCGAATTTGTG seq_100013_x54  54  hsa-miR-10a-5p  22  44  1GT 0   0   0   TATATACC    TGTGTAAG    miRNA   1

文件2:

mir_seq                                  seq    name    freq    mir start   end mism    add t5  t3  s5  s3  DB    ambiguity
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT  TGAGAAGAAGCACTGTAGCTCTT seq_100006_x1   1   hsa-miR-143-3p  61  81  6AT u-TT    0   0   AGTCTGAG    GCTCAGGA    miRNA   1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTA  GACCCTGTAGATCCGAATTTGTA seq_100012_x0   0   hsa-miR-10a-5p  22  43  1GT u-A 0   u-G TATATACC    TGTGTAAG    miRNA   1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTG  GACCCTGTAGATCCGAATTTGTG seq_100013_x24  24  hsa-miR-10a-5p  22  44  1GT 0   0   0   TATATACC    TGTGTAAG    miRNA   1
hsa-miR-1296-5p_TTAGGGCCCTGGCTCCATCT    TTAGGGCCCTGGCTCCATCT    seq_100019_x17  17  hsa-miR-1296-5p 16  35  0   0   0   u-CC    TGGGTTAG    CTCCTTTA    miRNA   1

这些文件的名称如下所示,仅在_.txt.mirna之间的部分不同,并且用制表符分隔:

Miraligner_94G.txt.mirna
Miraligner_944G.txt.mirna

输出文件应如下所示:

mir_seq                                  freq_94G     freq_944G     freq_912R
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT   0            12            55

Tags: 文件nametxtseqmirfreqmirnahsa
2条回答

好的,假设你正在处理文件:

Miraligner_94G.txt.mirna
Miraligner_944G.txt.mirna

看起来你只是从每一列中挑出一列。你知道吗

所以:

#!/usr/bin/env perl
use strict;
use warnings;

my %data;
my %seen;

foreach my $file ( glob("Miraligner_*") ) {
    my ($freq_id) = ( $file =~ m/\_(\w+).txt/ );
    $freq_id = "freq_$freq_id";
    $seen{$freq_id}++;
    open( my $input, "<", $file ) or die $!;
    my @headers = split( ' ', <$input> );
    while (<$input>) {
        my %line;
        @line{@headers} = split;
        my $key = $line{'mir_seq'};
        $data{$key}{$freq_id} = $line{'freq'};
    }
    close($input);
}

my @cols = sort keys %seen;
print join( "\t", "mir_seq", @cols ), "\n";
foreach my $mir_seq ( sort keys %data ) {
    my @output_cols = map { $_ // 0 } @{ $data{$mir_seq} }{@cols};
    print join( "\t", $mir_seq, @output_cols ), "\n";
}

给定数据集输出(制表符分隔):

mir_seq freq_944G   freq_94G
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTA  1   0
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTG  54  24
hsa-miR-1296-5p_TTAGGGCCCTGGCTCCATCT    0   17
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT  0   1

注意-如果一个值未定义,它当前将打印一个零。如果你想印别的东西,你需要修改那张地图。你知道吗

它还按字母顺序对大多数这些进行排序-这也可能不是你想要的,但有很多排序的例子,你可以参考。你知道吗

您只提供了一个示例输入文件,因此显然这是未经测试的,因为您不能仅用一个文件测试“合并”:

awk '
FNR==1 {
    split(FILENAME,tmp,/[_.]/)
    sfx = tmp[2]
    sfxs[sfx]
}
{
    keys[$1]
    val[$1,sfx] = $4
}
END {
    printf "mir_seq"
    for (sfx in sfxs) {
        printf "%sfreq_%s", OFS, sfx
    }
    print ""

    for (key in keys) {
        printf "%s", key
        for (sfx in sfxs) {
            printf "%s%d", OFS, val[key,sfx]
        }
        print ""
    }
}
' Miraligner_*

相关问题 更多 >

    热门问题