提取HTML标记之间的文本

2024-09-29 21:28:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这个字符串:

In December 2011, Norway's largest online sex shop hemmelig.com was <a href="http://www.dazzlepod.com/hemmelig/?page=93" target="_blank" rel="noopener">hacked by a collective calling themselves &quot;Team Appunity&quot;</a>. The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes.

(不要问)

在这个字符串中有一个指向站点本身的HREF链接,我需要做的是提取标签<a href=""></a>之间的信息。所以最终结果应该是这样的:

In December 2011, Norway's largest online sex shop hemmelig.com was hacked by a collective calling themselves &quot;Team Appunity&quot;. The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes.

到目前为止,我所能做的是使用正则表达式匹配整个标记,并将其替换为空:

def get_unlinked_description(descrip):
    html_tag_regex = re.compile(r"<.+>", re.I)
    return html_tag_regex.sub("", descrip)

但是,正如您所期望的那样,它的输出会删除整个字符串:

In December 2011, Norway's largest online sex shop hemmelig.com was . The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes

如何在不删除整个字符串的情况下,成功地提取标记之间的信息,同时删除标记? .


Tags: andthe字符串incomshoponlinewas
1条回答
网友
1楼 · 发布于 2024-09-29 21:28:52

你可能在找Beautiful Soup

至于你的执行。使用的代码是:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

soup.href.string

其中html_doc是要解析的字符串或文档,'html.parser'是要运行的python命令

最后应该返回In December 2011, Norway's largest online sex shop hemmelig.com was hacked by a collective calling themselves &quot;Team Appunity&quot;. The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes.

相关问题 更多 >

    热门问题