如何读取Pandas的数据集？

>Feature NC_000913< 190 255 CDS gene thrL inference NCBI RefSeq Database inference UniProtKB/Swiss-Prot:P0AD86 locus_tag 16127995 product thr operon leader peptide 337 2799 CDS gene thrA inference NCBI RefSeq Database inference UniProtKB/Swiss-Prot:P00561 locus_tag 16127996 product Bifunctional aspartokinase/homoserine dehydrogenase 1 2801 3733 CDS gene thrB inference NCBI RefSeq Database inference UniProtKB/Swiss-Prot:P00547 locus_tag 16127997 product homoserine kinase 3734 5020 CDS gene thrC inference NCBI RefSeq Database inference UniProtKB/Swiss-Prot:P00934 locus_tag 16127998 product L-threonine synthase 5234 5530 CDS gene yaaX inference NCBI RefSeq Database inference UniProtKB/Swiss-Prot:P75616 locus_tag 16127999 product DUF2502 family putative periplasmic protein

>Feature NC_000913< 0 190\t255\tCDS 1 \t\t\tgene\tthrL 2 \t\t\tinference\tNCBI RefSeq Database 3 \t\t\tinference\tUniProtKB/Swiss-Prot:P0AD86 4 \t\t\tlocus_tag\t16127995 5 \t\t\tproduct\tthr operon leader peptide 6 337\t2799\tCDS 7 \t\t\tgene\tthrA 8 \t\t\tinference\tNCBI RefSeq Database 9 \t\t\tinference\tUniProtKB/Swiss-Prot:P00561 10 \t\t\tlocus_tag\t16127996 11 \t\t\tproduct\tBifunctional aspartokinase/homo... 12 2801\t3733\tCDS 13 \t\t\tgene\tthrB 14 \t\t\tinference\tNCBI RefSeq Database 15 \t\t\tinference\tUniProtKB/Swiss-Prot:P00547 16 \t\t\tlocus_tag\t16127997 17 \t\t\tproduct\thomoserine kinase 18 3734\t5020\tCDS 19 \t\t\tgene\tthrC 20 \t\t\tinference\tNCBI RefSeq Database Skipping line 37: Expected 1 fields in line 37, saw 2 Skipping line 79: Expected 1 fields in line 79, saw 2 Skipping line 85: Expected 1 fields in line 85, saw 2

2条回答

网友

1楼 · 编辑于 2024-09-21 05:53:00

您的文件是以制表符分隔的，但标题行没有显示字段的数量。您可以使用names参数作为提示：

df = pd.read_csv("K12.tbl", sep='\t', names=['A', 'B', 'C', 'D', 'E'], error_bad_lines=False)

网友

2楼 · 编辑于 2024-09-21 05:53:00

import re
import pandas as pd
with open('K12.tbl', 'r') as f:
    data = f.readlines()

# Get rid of newlines
data = [x.replace('\n','') for x in data]
# Get rid of row number and leading spaces
data = [re.sub('(\d+\s{2,})','',x) for x in data]
# Get rid of leading tabs
data = [re.sub('(\\\\t){3}','',x) for x in data]
# Get rid of footer lines
data = [x for x in data[1:] if 'Skipping' not in x]

# Get every 6th element which contains the number ranges you want
numbers = data[::6]
# Split the numbers into columns
numbers = [x.split('\\t') for x in numbers]

# Create a dataframe of the start/stop/cds values
df = pd.DataFrame(numbers, columns=['start','stop','cds'])

# Shift the start column back one row to create column that holds the next start number
df['next_start'] = df['start'].shift(-1)

# Fill the last next_start NAN with zero
df = df.fillna(0)

# Create binary map of which rows represent larger than allowed skips
big_diff = df['next_start'].astype(int) - df['stop'].astype(int) > 20

# Get list of indexes where the skips are too big
big_diff_index = big_diff[big_diff].index.values

# The value you want is in the 3rd row after each set of numbers, get that row and split on :, return the value at the end
[data[x+3].split(':')[-1] for x in big_diff_index]

输出

['P0AD86']

相关问题更多 >

编程相关推荐

热门问题

热门文章