我有这个字符串:
In December 2011, Norway's largest online sex shop hemmelig.com was <a href="http://www.dazzlepod.com/hemmelig/?page=93" target="_blank" rel="noopener">hacked by a collective calling themselves "Team Appunity"</a>. The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes.
(不要问)
在这个字符串中有一个指向站点本身的HREF链接,我需要做的是提取标签<a href=""></a>
之间的信息。所以最终结果应该是这样的:
In December 2011, Norway's largest online sex shop hemmelig.com was hacked by a collective calling themselves "Team Appunity". The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes.
到目前为止,我所能做的是使用正则表达式匹配整个标记,并将其替换为空:
def get_unlinked_description(descrip):
html_tag_regex = re.compile(r"<.+>", re.I)
return html_tag_regex.sub("", descrip)
但是,正如您所期望的那样,它的输出会删除整个字符串:
In December 2011, Norway's largest online sex shop hemmelig.com was . The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes
如何在不删除整个字符串的情况下,成功地提取标记之间的信息,同时删除标记? .
你可能在找Beautiful Soup
至于你的执行。使用的代码是:
其中
html_doc
是要解析的字符串或文档,'html.parser'
是要运行的python命令最后应该返回
In December 2011, Norway's largest online sex shop hemmelig.com was hacked by a collective calling themselves "Team Appunity". The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes.
相关问题 更多 >
编程相关推荐