比较两个文件中的文本并在字段中追加文本

ProbeID rsID chr bp strand alleleA alleleB SNP_A-1780270 rs987435 7 78599583 - C G SNP_A-1780271 rs345783 15 33395779 - C G SNP_A-1780272 rs955894 1 189807684 - G T SNP_A-1780274 rs6088791 20 33907909 - A G SNP_A-1780277 rs11180435 12 75664046 + C T SNP_A-1780278 rs17571465 1 218890658 - A T SNP_A-1780283 rs17011450 4 127630276 - C T SNP_A-1780285 rs6919430 6 90919465 + A C SNP_A-1780286 rs41528453 --- --- --- A G SNP_A-1780287 rs2342723 16 5748791 + C T

ProbeID call genotype SNP_A-1780270 2 G G SNP_A-1780271 0 C C SNP_A-1780272 2 T T SNP_A-1780274 1 A G SNP_A-1780277 0 C C SNP_A-1780278 2 T T SNP_A-1780283 2 T T SNP_A-1780285 2 C C SNP_A-1780286 0 A A SNP_A-1780287 0 C C

3条回答

网友

1楼 · 编辑于 2024-09-28 22:14:17

使用pandas：

import pandas as pd
import re

A = pd.read_csv('FileA', delimiter = r'\s+')
B = pd.read_csv('FileB', delimiter = r'\s+')
A = A.set_index(['ProbeID'])
B = B.set_index(['ProbeID'])
C = pd.concat([A,B], axis = 1)

idx = C['call'] == 0
C['alleleB'][idx]  = C['alleleA'][idx]
idx = C['call'] == 2
C['alleleA'][idx]  = C['alleleB'][idx]
print(C[['call', 'alleleA', 'alleleB']])

收益率

^{pr2}$

如果您有许多Bfiles，您可以使用如下方法：

import pandas as pd
import re

A = pd.read_csv('FileA', delimiter = r'\s+')
A = A.set_index(['ProbeID'])

BFiles = ['FileB1', 'FileB2', 'FileB3']
for i, bfile in enumerate(BFiles):
    B = pd.read_csv('FileB', delimiter = r'\s+')
    B = B.set_index(['ProbeID'])
    C = pd.concat([A,B], axis = 1)

    idx = C['call'] == 0
    C['alleleB'][idx]  = C['alleleA'][idx]
    idx = C['call'] == 2
    C['alleleA'][idx]  = C['alleleB'][idx]
    cfile = 'FileC{i}'.format(i = i)
    with open(cfile, 'w') as f:
        f.write(C[['call', 'alleleA', 'alleleB']])

将cfile更改为适当的值。在

网友

2楼 · 编辑于 2024-09-28 22:14:17

使用嵌套字典可以很容易地完成此任务：

data = {}
with open(fileA) as fA:
    header = next(fA).split()
    attributes = header[1:]
    for line in fA:
        lst = line.split()
        data[lst[0]] = dict(zip(attributes,l[1:])

with open(fileB) as fB:
    header = next(fB).split()
    for line in fB:
        ID,call = line.split()
        data[ID]['call'] = int(call)

现在您只需迭代数据并只打印所需的内容。在

或者，如果这些行完全对应（或者如果使用python3，则只使用普通的zip），则可以使用itertools.izip一次处理一行：

^{pr2}$

网友

3楼 · 编辑于 2024-09-28 22:14:17

这是一个R解决方案。在

my.data <- merge(df1, df2, by = "ProbeID")

# select rows based on call
zero <- my.data$call == 0
one <- my.data$call == 1
two <- my.data$call == 2

# subset rows based on previous condition and calculate genotype
my.data[zero, "genotype"] <- paste(my.data$alleleA[zero], my.data$alleleA[zero], sep = " ")
my.data[one, "genotype"] <- paste(my.data$alleleA[one], my.data$alleleB[one], sep = " ")
my.data[two, "genotype"] <- paste(my.data$alleleB[two], my.data$alleleB[two], sep = " ")

my.data[, c("ProbeID", "call", "genotype")]


        ProbeID call genotype
1  SNP_A-1780270    2      G G
2  SNP_A-1780271    0      C C
3  SNP_A-1780272    2      T T
4  SNP_A-1780274    1      A G
5  SNP_A-1780277    0      C C
6  SNP_A-1780278    2      T T
7  SNP_A-1780283    2      T T
8  SNP_A-1780285    2      C C
9  SNP_A-1780286    0      A A
10 SNP_A-1780287    0      C C

相关问题更多 >

编程相关推荐

热门问题

热门文章