在python中解码html编码的字符串

2024-09-30 18:19:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我有下面的绳子。。。在

"Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."

我要把它变成这根绳子。。。在

Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process.

这是相当标准的HTML编码,我一辈子都不知道如何在python中转换它。在

我发现了这个: GitHub

它非常接近工作,但是它不输出撇号,而是输出一些非unicode字符。在

下面是GitHub脚本的输出示例。。。在

Scam, hoax, or the real deal, heâs gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process.


Tags: orofthetorealwayworkhe
1条回答
网友
1楼 · 发布于 2024-09-30 18:19:29

您要做的是“HTML实体解码”,它包含在许多过去的堆栈溢出问题中,例如:

下面是一个使用Beautiful SoupHTML解析库对示例进行解码的代码片段:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup

string = "Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process."
s = BeautifulSoup(string,convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0]
print s

输出如下:

Scam, hoax, or the real deal, he’s gonna work his way to the bottom of the sordid tale, and hopefully end up with an arcade game in the process.

相关问题 更多 >