使用正则表达式将结构化但非tabular文本解析为pandas

import pandas as pd import re # write regular expressions rx_dict = { 'genome': re.compile(r'genome (?P<genome>.*)\n'), 'source': re.compile(r'source (?P<source>.*)\n'), 'reference': re.compile(r'reference (?P<reference>.*)\n'), } # line parser def parse_line(line): for key, rx in rx_dict.items(): match = rx.search(line) if match: return key, match # if there are no matches return None, None

def parse_file(filepath): data = [] # open the file and read through it line by line with open(filepath, 'r') as file_object: line = file_object.readline() while line: # at each line check for a match with a regex key, match = parse_line(line) # extract from each line if key == 'genome': genome = match.group('genome') if key == 'Source': Source = match.group('Source') if key == 'reference': Type = match.group('reference') while line.strip(): row = { 'genome': genome, 'reference': reference, 'Source': Source, } data.append(row) data = pd.DataFrame(data) return data

3条回答

网友

1楼 · 编辑于 2024-05-17 23:04:55

仅使用熊猫，我们可以使用str.split

df = pd.read_csv('tmp.txt',sep='|',header=None)
s = df[0].str.split(' ',expand=True)

df_new = s.set_index([0,s.groupby(0).cumcount()]).unstack(0)

print(df_new)

                 1                      
0           genome reference      source
0  Bacteroidetes_4      B650  carotenoid
1  Desulfovibrio_3      B123  Polyketide
2              NaN      B839  flexirubin

网友

2楼 · 编辑于 2024-05-17 23:04:55

您的问题是在这里读取文件时

with open(filepath, 'r') as file_object:        
    line = file_object.readline()        
    while line:

line的值永远不会改变，因此while循环会无休止地运行

更改为：

with open(filepath, 'r') as file_object: 
    lines = file_object.readlines()
    for line in lines:

网友

3楼 · 编辑于 2024-05-17 23:04:55

您是否尝试过在while循环中设置断点并使用调试器查看发生了什么

您只需使用：

breakpoint()

使用Python>；=3.7. 对于旧版本：

import pdb

# your code

# for each part you are
# interested in the while 
# loop:
pdb.set_trace()

然后在启用调试器的情况下运行脚本：

>>> python3 -m pdb yourscript.py

使用“c”继续到下一个断点。有关命令的详细信息，请参见the documentation

如果您使用的IDE具有集成调试器，那么也可以使用集成调试器，这样使用起来就不那么麻烦了

顺便说一句，这可能是因为您使用了while line，然后似乎从未读过新行，所以只要第一行不是空字符串，语句的计算结果就会为True，并无限期地停留在while循环中。您可以尝试使用for循环来迭代该文件

例如

with open('file.suffix', 'r') as fileobj:
    for line in fileobj:
        # your logic

相关问题更多 >

编程相关推荐

热门问题

热门文章