如何读取Pandas的数据集?

2024-09-21 05:53:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试编写一个Python脚本,从以下数据集中过滤出一些信息:

>Feature NC_000913<
190 255 CDS
            gene    thrL
            inference   NCBI RefSeq Database
            inference   UniProtKB/Swiss-Prot:P0AD86
            locus_tag   16127995
            product thr operon leader peptide
337 2799    CDS
            gene    thrA
            inference   NCBI RefSeq Database
            inference   UniProtKB/Swiss-Prot:P00561
            locus_tag   16127996
            product Bifunctional aspartokinase/homoserine dehydrogenase 1
2801    3733    CDS
            gene    thrB
            inference   NCBI RefSeq Database
            inference   UniProtKB/Swiss-Prot:P00547
            locus_tag   16127997
            product homoserine kinase
3734    5020    CDS
            gene    thrC
            inference   NCBI RefSeq Database
            inference   UniProtKB/Swiss-Prot:P00934
            locus_tag   16127998
            product L-threonine synthase
5234    5530    CDS
            gene    yaaX
            inference   NCBI RefSeq Database
            inference   UniProtKB/Swiss-Prot:P75616
            locus_tag   16127999
            product DUF2502 family putative periplasmic protein

我需要的任务是搜索超过20个数字长的差距,例如255-337。然后,它应该在文本文件中写入gap和瑞士保护id,例如P0AD86-P00561。 我试着使用熊猫,因为我认为它适合这个任务。 我的尝试是:

import pandas as sd

df = pd.read_csv("K12.tbl", error_bad_lines=False, header=(0), engine='python')

print(df.head(21))

试图将.tbl文件中的排序放入结构化表中,这是我的输出:

>Feature NC_000913<
0                                       190\t255\tCDS
1                                    \t\t\tgene\tthrL
2               \t\t\tinference\tNCBI RefSeq Database
3        \t\t\tinference\tUniProtKB/Swiss-Prot:P0AD86
4                           \t\t\tlocus_tag\t16127995
5            \t\t\tproduct\tthr operon leader peptide
6                                      337\t2799\tCDS
7                                    \t\t\tgene\tthrA
8               \t\t\tinference\tNCBI RefSeq Database
9        \t\t\tinference\tUniProtKB/Swiss-Prot:P00561
10                          \t\t\tlocus_tag\t16127996
11  \t\t\tproduct\tBifunctional aspartokinase/homo...
12                                    2801\t3733\tCDS
13                                   \t\t\tgene\tthrB
14              \t\t\tinference\tNCBI RefSeq Database
15       \t\t\tinference\tUniProtKB/Swiss-Prot:P00547
16                          \t\t\tlocus_tag\t16127997
17                   \t\t\tproduct\thomoserine kinase
18                                    3734\t5020\tCDS
19                                   \t\t\tgene\tthrC
20              \t\t\tinference\tNCBI RefSeq Database
Skipping line 37: Expected 1 fields in line 37, saw 2
Skipping line 79: Expected 1 fields in line 79, saw 2
Skipping line 85: Expected 1 fields in line 85, saw 2

https://pastebin.com/N1z9mpqb)。 我不知道如何得到一个合适的表格,以及如何比较这些数字来找出差距。这是我第一次做数据分析。。。 我希望有人能帮助我,并感谢任何想法:D


Tags: taglinencbiproductdatabasegenelocuscds
2条回答

您的文件是以制表符分隔的,但标题行没有显示字段的数量。您可以使用names参数作为提示:

df = pd.read_csv("K12.tbl", sep='\t', names=['A', 'B', 'C', 'D', 'E'], error_bad_lines=False) 
import re
import pandas as pd
with open('K12.tbl', 'r') as f:
    data = f.readlines()

# Get rid of newlines
data = [x.replace('\n','') for x in data]
# Get rid of row number and leading spaces
data = [re.sub('(\d+\s{2,})','',x) for x in data]
# Get rid of leading tabs
data = [re.sub('(\\\\t){3}','',x) for x in data]
# Get rid of footer lines
data = [x for x in data[1:] if 'Skipping' not in x]

# Get every 6th element which contains the number ranges you want
numbers = data[::6]
# Split the numbers into columns
numbers = [x.split('\\t') for x in numbers]

# Create a dataframe of the start/stop/cds values
df = pd.DataFrame(numbers, columns=['start','stop','cds'])

# Shift the start column back one row to create column that holds the next start number
df['next_start'] = df['start'].shift(-1)

# Fill the last next_start NAN with zero
df = df.fillna(0)

# Create binary map of which rows represent larger than allowed skips
big_diff = df['next_start'].astype(int) - df['stop'].astype(int) > 20

# Get list of indexes where the skips are too big
big_diff_index = big_diff[big_diff].index.values

# The value you want is in the 3rd row after each set of numbers, get that row and split on :, return the value at the end
[data[x+3].split(':')[-1] for x in big_diff_index]

输出

['P0AD86']

相关问题 更多 >

    热门问题