Python使用正则表达式从Html中提取文本

2024-09-27 21:24:32 发布

您现在位置:Python中文网/ 问答频道 /正文

原始文本的一部分如下所示,并存储在txt文件中。Html源代码相似但不完整。你知道吗

<span style="cursor:pointer" onmousedown="HI466('1056').click()">Steffen Eddine (PhD) (SEED)</span></span></div><script>HI466("100256").checked=T</script><div id=“k62” style="left:95px;top:15px;width:32;height:25;"><span id="321" name="021"><span style="cursor:pointer" onmousedown="HI466('2321').click()">Petra Schmidt (PESC)</span></span></div><script>HI466("239021").checked=T</script><div id=“k62” style="left:65px;top:15px;width:32;height:25;"><span id="306" name="366"><span style="cursor:pointer" onmousedown="HI466('2366').click()">Peter Kumar (PEKU)</span></span></div><script>HI466("230866").checked=T</script><div id=“k62” style="left:25px;top:35px;width:32;height:25;"><span id="425" name="511"><span style="cursor:pointer" onmousedown="HI466('2421').click()">Raksha Khaldoun (RAKH)</span></span></div><script>HI466("242511").checked=T</script><div id=“k62” style="left:95px;top:35px;width:32;height:25;"><span id="176" name="146"><span style="cursor:pointer" onmousedown="HI466('2176').click()">Yash Chevalier (YACH)</span>

我想要的是从那里得到诸如“Steffen Eddine(PhD)(种子)”之类的名字。你知道吗

很明显,他们都是从

import re

with open ("original_text.txt", "r") as myfile:
data = myfile.read()

aa = re.search(""<span style="cursor:pointer" onmousedown="", data)

我怎样才能把它们挑出来?(我也尝试过使用BeautifulSoup,但没有成功)。你知道吗


用户提交如下。我发现它非常接近我需要的东西。你知道吗

但是,它只返回5“span style=”光标:指针“onmousedown=”“。我还需要做什么?你知道吗

for m in re.finditer('<span style="cursor:pointer" onmousedown="',data, re.IGNORECASE | re.MULTILINE):
    print m.group(0)

Tags: divreidstyletopscriptleftcursor
3条回答

与BeautifulSoup相同:

from BeautifulSoup import BeautifulSoup                                                                                                                       
data = '''<span style="cursor:pointer" onmousedown="HI466('1056').click()">Steffen Eddine (PhD) (SEED)</span></span></div><script>HI466("100256").checked=T</script><div id=“k62” style="left:95px;top:15px;width:32;height:25;"><span id="321" name="021"><span style="cursor:pointer" onmousedown="HI466('2321').click()">Petra Schmidt (PESC)</span></span></div><script>HI466("239021").checked=T</script><div id=“k62” style="left:65px;top:15px;width:32;height:25;"><span id="306" name="366"><span style="cursor:pointer" onmousedown="HI466('2366').click()">Peter Kumar (PEKU)</span></span></div><script>HI466("230866").checked=T</script><div id=“k62” style="left:25px;top:35px;width:32;height:25;"><span id="425" name="511"><span style="cursor:pointer" onmousedown="HI466('2421').click()">Raksha Khaldoun (RAKH)</span></span></div><script>HI466("242511").checked=T</script><div id=“k62” style="left:95px;top:35px;width:32;height:25;"><span id="176" name="146"><span style="cursor:pointer" onmousedown="HI466('2176').click()">Yash Chevalier (YACH)</span>'''
soup = BeautifulSoup(data)                                                                                                                                    
print [s.string for s in soup.findAll('span') if s.string]                                                                                                    

千万不要使用regex来解析htmlxml文件,您只需使用lxmlbeautifulsoup之类的相关模块即可:

>>> from lxml.html import fromstring
>>> s="""<span style="cursor:pointer" onmousedown="HI466('1056').click()">Steffen Eddine (PhD) (SEED)</span></span></div><script>HI466("100256").checked=T</script><div id=“k62” style="left:95px;top:15px;width:32;height:25;"><span id="321" name="021"><span style="cursor:pointer" onmousedown="HI466('2321').click()">Petra Schmidt (PESC)</span></span></div><script>HI466("239021").checked=T</script><div id=“k62” style="left:65px;top:15px;width:32;height:25;"><span id="306" name="366"><span style="cursor:pointer" onmousedown="HI466('2366').click()">Peter Kumar (PEKU)</span></span></div><script>HI466("230866").checked=T</script><div id=“k62” style="left:25px;top:35px;width:32;height:25;"><span id="425" name="511"><span style="cursor:pointer" onmousedown="HI466('2421').click()">Raksha Khaldoun (RAKH)</span></span></div><script>HI466("242511").checked=T</script><div id=“k62” style="left:95px;top:35px;width:32;height:25;"><span id="176" name="146"><span style="cursor:pointer" onmousedown="HI466('2176').click()">Yash Chevalier (YACH)</span>"""
>>> st=fromstring(s)
>>> [c.text for c in st.getchildren() if c.text]
['Steffen Eddine (PhD) (SEED)', 'HI466("100256").checked=T', 'HI466("239021").checked=T', 'HI466("230866").checked=T', 'HI466("242511").checked=T']

在这里,您可以使用lxml来提取文本,然后根据需要修改结果以获得正确的结果!你知道吗

看这里的演示https://regex101.com/r/gE8rD2/1

import re
p = re.compile(ur'">([^<]+)', re.MULTILINE)
test_str = "your string"

re.findall(p, test_str)

相关问题 更多 >

    热门问题