使用python从文本中提取以符号开头并与其他字符串组合的字符串

2024-10-03 11:26:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个文件与大量的网址和普通文本在一起 示例:

'http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Reference http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Informal ACADEMIC type http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#school ACADEMIC type'

我想得到:

'Reference Informal ACADEMIC type school ACADEMIC type'

我试过了

substr1 = re.findall(r"#(\w+)", text1)

这是工作的一部分,但我不知道如何提取我想要的这些部分,并将它们与文本中的其他单词结合起来。基本上,我必须去掉URL和“#”符号。有人能帮我吗?你知道吗


Tags: 文件org文本httptyperdfreference网址
2条回答

将其转过来;删除URL:

re.sub(r'\bhttps?://[^# ]+#?', '', text1)

演示:

>>> import re
>>> text1 = '\bhttp://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Reference http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Informal ACADEMIC type http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#school ACADEMIC type'
>>> re.sub(r'https?://[^# ]+#?', '', text1)
'Reference Informal ACADEMIC type school ACADEMIC type'

表达式查找以http://https://开头的任何内容,并删除其后不是哈希或空格的任何内容,包括可选哈希。你知道吗

使用re.findall

>>> import re
>>> s = 'http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Reference http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#Informal ACADEMIC type http://oaei.ontologymatching.org/2011/benchmarks/101/onto.rdf#school ACADEMIC type'
>>> ''.join(re.findall(r'#(.*?)(?=https?:|$)', s))
'Reference Informal ACADEMIC type school ACADEMIC type'

说明:http://regex101.com/r/dV5uR2

相关问题 更多 >