如何基于正则表达式模式从文本文件中提取数据

2条回答

网友
1楼 · 编辑于 2024-09-28 21:29:31

这似乎对您的示例文本有效。我不知道每个文件是否可以有一个以上的提取，我在这里时间不够，所以如果需要，您必须扩展它：
#!python3 import re Extract = {} def match_notes(line): global _State pattern = r"^\s+(.*)$" m = re.match(pattern, line.rstrip()) if m: if 'notes' not in Extract: Extract['notes'] = [] Extract['notes'].append(m.group(1)) return True else: _State = match_sp return False def match_pattern(line): global _State pattern = r"^\s+Pattern: (.*)$" m = re.match(pattern, line.rstrip()) if m: Extract['pattern'] = m.group(1) _State = match_notes return True return False def match_sp(line): global _State pattern = r">sp\|([^|]+)\|(.*)$" m = re.match(pattern, line.rstrip()) if m: if 'sp' not in Extract: Extract['sp'] = [] spinfo = { 'accession code': m.group(1), 'other code': m.group(2), } Extract['sp'].append(spinfo) _State = match_sp_note return True return False def match_sp_note(line): """Second line of >sp paragraph""" global _State pattern = r"^([^[]*)\[([^]]+)\)" m = re.match(pattern, line.rstrip()) if m: spinfo = Extract['sp'][-1] spinfo['note'] = m.group(1).strip() spinfo['species'] = m.group(2).strip() spinfo['sequence'] = '' _State = match_sp_sequence return True return False def match_sp_range(line): """Last line of >sp paragraph""" global _State pattern = r"^\s+(\d+) - (\d+):\s+(.*)" m = re.match(pattern, line.rstrip()) if m: spinfo = Extract['sp'][-1] spinfo['range'] = (m.group(1), m.group(2)) spinfo['flags'] = m.group(3) _State = match_sp return True return False def match_sp_sequence(line): """Middle block of >sp paragraph""" global _State spinfo = Extract['sp'][-1] if re.match("^\s", line): # End of sequence. Check for pattern, reset state for sp if re.match(r"[AG].{4}GK[ST]", spinfo['sequence']): spinfo['ag_4gkst'] = True else: spinfo['ag_4gkst'] = False _State = match_sp_range return False spinfo['sequence'] += line.rstrip() return True def match_start(line): """Start of outer item""" global _State pattern = r"^Hits for ([A-Z]+\d+)|([^:]+) : (?:\[occurs (\w+)\])?" m = re.match(pattern, line.rstrip()) if m: Extract['pattern_id'] = m.group(1) Extract['title'] = m.group(2) Extract['occurrence'] = m.group(3) _State = match_pattern return True return False _State = match_start def process_line(line): while True: state = _State if state(line): return True if _State is not state: continue if len(line) == 0: return False print("Unexpected line:", line) print("State was:", _State) return False def process_file(filename): with open(filename, "r") as infile: for line in infile: process_line(line.rstrip()) process_file("ploop.fa") import pprint pprint.pprint(Extract)

网友
2楼 · 编辑于 2024-09-28 21:29:31

我的第一个建议是在打开文件时使用with语句：
with open("ploop.fa", "r") as file: FilterOnRegEx(file)
您的FilterOnRegEx方法的问题是：if ploop in line。带字符串参数的^{}运算符在字符串line中搜索ploop中的确切文本。在
相反，您需要compile将文本形式转换为re对象，然后search来匹配：
^{pr2}$
这将有助于你向前迈进。在
下一步，我建议学习generators。打印匹配的行很好，但这无助于您对它们进行进一步的操作。我可能会将print更改为yield，这样我就可以进一步处理数据，例如提取所需的部分并重新格式化以供输出。在
作为一个简单的演示：
def FilterOnRegEx(file): ploop = ("[AG].{4}GK[ST]") pattern = re.compile(ploop) for line in file: match = pattern.search(line) if match is not None: yield line with open("ploop.fa", "r") as file: for line in FilterOnRegEx(file): print(line)
附录：我使用您发布的数据样本运行了上面我发布的代码，它成功地打印了一些行而不是其他行。换句话说，正则表达式确实匹配某些行，而不匹配其他行。到现在为止，一直都还不错。但是，，您需要的数据并不是全部在输入的一行上！这意味着在模式上过滤单个行是不够的。（当然，除非我在问题中看不到正确的换行符）问题中的数据方式，您需要实现一个更健壮的解析器，该解析器具有状态，以了解记录何时开始、何时结束以及记录中间的任何给定行。在

相关问题更多 >

编程相关推荐

热门问题

热门文章