如何使用正则表达式从文本中提取由标记分隔的多个引用?

2024-06-24 12:37:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个手动输入文件,由引文组成,每个引文的格式如下:

< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>

下面是我目前使用python的re模块的方法:

citance = citance[citance.find(">")+1:citance.rfind("<")]
fd.write(citance+"\n")

我试图提取从第一个结束的尖括号(“>;”)到最后一个开始的尖括号(“<;”)的所有内容。但是,在多个citance的情况下,这种方法会失败,因为中间标记也会在输出中提取:

It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.

我想要的输出:

It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier. Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.

我如何才能正确地实现这一点?你知道吗


Tags: ofthefrominformationthatitdocumentbased
3条回答

我想这就是你要找的。你知道吗

import re

string = ">here is some text<>here is some more text<"
matches = re.findall(">(.*?)<", string)
for match in matches: print match

似乎你在获取太多结果方面遇到了问题。“这里是更多的文本& lt”的匹配可以从字符串中的第一个字符到最后一个字符,因为它们是“& gt”和“& lt”,而忽略中间的字符。“.*?”成语会使它找到最多的点击次数。你知道吗

不要使用re模块,而是查看bs4库。你知道吗

这是一个XML/HTML解析器,因此您可以获得标记之间的所有内容。你知道吗

对你来说,会是这样的:

from bs4 import BeautifulSoup

xml_text = '< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>'

text_soup = BeautifulSoup(xml_text, 'lxml')

output = text_soup.find_all('S', attrs = {'sid': '2'})

输出将包含文本:

It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.

此外,如果您只想删除html标记:

import re

xml_text = '< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>'

re.sub('<.*?>', '', html_text)

我会做的。你知道吗

我将使用python regex模块:re 通过这样做:

re.findall(r'\">(.*?)<', text_to_parse)

这个方法将从一个引号返回到多个引号,但是如果您想要一个统一的文本(" ".join(....)),就可以将它们连接起来

相关问题 更多 >