如何在CSV文件中添加使用正则表达式找到的信息

import re import csv with open('doubt2.txt','r', encoding="utf-8") as f: f_contents = f.read() regexHOR =r'\n(TITLE-\S+-\d{2}-\d{2})[:|;](.*?)^Conditions' regexOD = r'^Opening date\s+(\d{1,2} \w+ \d{4})\s*?' regexDL =r'^Deadline\s+(\d+ \w+ \d+)' patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL) patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL) patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL) matchesHOR = patternHOR.finditer(f_contents) matchesOD = patternOD.finditer(f_contents) matchesDL = patternDL.finditer(f_contents)

with open("result.csv", "w",newline='') as outfile: csvfile = csv.writer(outfile) csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline']) for match in matchesHOR: csvfile.writerow([match.group(1), match.group(2).replace('\n', ' '),'','']) for match in matchesOD: csvfile.writerow(['','',match.group(1),'']) for match in matchesDL: csvfile.writerow(['','','',match.group(1)])

2条回答

网友

1楼 · 编辑于 2024-09-28 21:33:38

您需要重新安排一些内容，以便同时为一行写入所有项目。这里的方法是使用match_hor查找每个标题开头，然后将其用作match_od的起点，而match_dl又用作match_dl的起点

import re
import csv
    
with open('doubt2.txt','r', encoding="utf-8") as f:
    f_contents = f.read()

regexHOR = r'\n(TITLE-\S+-\d{2}-\d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening date\s+(\d{1,2} \w+ \d{4})\s*?'
regexDL =r'^Deadline\s+(\d+ \w+ \d+)'

patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)

with open("result.csv", "w",newline='') as outfile:
    csvfile = csv.writer(outfile)
    csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
    
    for match_hor in patternHOR.finditer(f_contents):
        code, title = [match_hor.group(1), match_hor.group(2).replace('\n', ' ')]
        offset = match_hor.end()
        
        match_od = patternOD.search(f_contents[offset:])
        offset += match_od.end()
        opening = match_od.group(1)
        
        match_dl = patternDL.search(f_contents[offset:]) 
        offset += match_dl.end()
        deadline = match_dl.group(1)
        
        csvfile.writerow([code, title.strip(), opening, deadline])

这将为您提供包含以下内容的result.csv：

Topic ID,Title,Opening date,Deadline
TITLE-SDFSD-DFDS-SFDS-01-01,This is the title 1 that  is split into two lines with a blank line in the middle,15 Apr 2021,26 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-02,This is the title2 in one single line,15 March 2021,17 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-03,This is the title3 that is too long and takes two lines,15 May 2021,26 Sep 2021

网友

2楼 · 编辑于 2024-09-28 21:33:38

我建议您使用positive lookahead, lookbehind and namedgroup编写以下代码：

>>> regexHOR = r'(?P<TopicID>TITLE-\S+-\d{2}-\d{2})[:;]\s*(?P<Title>[\w\s]+(?=Conditions))'
>>>
>>> regexOD = r'(?P<OpeningDate>(?<=Opening date )\d{1,2} \w+ \d{4})'
>>>
>>> regexDL = r'(?P<DeadLine>(?<=Deadline )\d+ \w+ \d+)'
>>>
>>>regex_pattern = re.compile('.*?'.join([regexHOR, regexOD, regexDL]), re.MULTILINE | re.DOTALL)
>>>
>>> for match in re.finditer(regex_pattern, f_contents):
        csvfile.writerow([match.group('TopicID'), match.group('Title'), \
        match.group('OpeningDate'), match.group('DeadLine')])

每次调用csvfile.writerow，都会写入一个新行，这就是为什么没有将每个循环迭代的所有项都写入同一行的原因

相关问题更多 >

编程相关推荐

热门问题

热门文章