如何使用正则表达式从文本中提取由标记分隔的多个引用？

网友

1楼 · 编辑于 2024-06-24 12:37:43

我想这就是你要找的。你知道吗

import re

string = ">here is some text<>here is some more text<"
matches = re.findall(">(.*?)<", string)
for match in matches: print match

似乎你在获取太多结果方面遇到了问题。“这里是更多的文本& lt”的匹配可以从字符串中的第一个字符到最后一个字符，因为它们是“& gt”和“& lt”，而忽略中间的字符。“.*？”成语会使它找到最多的点击次数。你知道吗

网友

2楼 · 编辑于 2024-06-24 12:37:43

不要使用re模块，而是查看bs4库。你知道吗

这是一个XML/HTML解析器，因此您可以获得标记之间的所有内容。你知道吗

对你来说，会是这样的：

from bs4 import BeautifulSoup

xml_text = '< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>'

text_soup = BeautifulSoup(xml_text, 'lxml')

output = text_soup.find_all('S', attrs = {'sid': '2'})

输出将包含文本：

It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.

此外，如果您只想删除html标记：

import re

xml_text = '< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>'

re.sub('<.*?>', '', html_text)

我会做的。你知道吗

网友

3楼 · 编辑于 2024-06-24 12:37:43

我将使用python regex模块：re 通过这样做：

re.findall(r'\">(.*?)<', text_to_parse)

这个方法将从一个引号返回到多个引号，但是如果您想要一个统一的文本（" ".join(....)），就可以将它们连接起来

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用正则表达式从文本中提取由标记分隔的多个引用？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >