我有一个手动输入文件,由引文组成,每个引文的格式如下:
< S sid ="2" ssid = "2">It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.< /S>
下面是我目前使用python的re模块的方法:
citance = citance[citance.find(">")+1:citance.rfind("<")]
fd.write(citance+"\n")
我试图提取从第一个结束的尖括号(“>;”)到最后一个开始的尖括号(“<;”)的所有内容。但是,在多个citance的情况下,这种方法会失败,因为中间标记也会在输出中提取:
It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier.< /S>< S sid ="3" ssid = "3">Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.
我想要的输出:
It differs from previous machine learning-based NERs in that it uses information from the whole document to classify each word, with just one classifier. Previous work that involves the gathering of information from the whole document often uses a secondary classifier, which corrects the mistakes of a primary sentence- based classifier.
我如何才能正确地实现这一点?你知道吗
我想这就是你要找的。你知道吗
似乎你在获取太多结果方面遇到了问题。“这里是更多的文本& lt”的匹配可以从字符串中的第一个字符到最后一个字符,因为它们是“& gt”和“& lt”,而忽略中间的字符。“.*?”成语会使它找到最多的点击次数。你知道吗
不要使用re模块,而是查看bs4库。你知道吗
这是一个XML/HTML解析器,因此您可以获得标记之间的所有内容。你知道吗
对你来说,会是这样的:
输出将包含文本:
此外,如果您只想删除html标记:
我会做的。你知道吗
我将使用python regex模块:
re
通过这样做:这个方法将从一个引号返回到多个引号,但是如果您想要一个统一的文本(
" ".join(....)
),就可以将它们连接起来相关问题 更多 >
编程相关推荐