在python中替换HTML代码

2024-10-04 07:36:21 发布

您现在位置：Python中文网/ 问答频道 /正文

1311

网友

男 | 程序猿一只，喜欢编程写python代码。

我使用正则表达式来解析网站的源代码，并在Tkinter窗口中显示新闻标题。有人告诉我用regex解析HTML不是最好的主意，但不幸的是现在没有时间去改变。在

我似乎无法为诸如撇号（'）等特殊字符替换HTML代码。在

目前我有以下情况：

union_url = 'http://www.news.com.au/sport/rugby'

def union():
    union_string = urlopen(union_url).read()
    union_string.replace("&#8217;", "'")
    union_headline = re.findall('(?:sport/rugby/.*) >(.*)<', union_string)
    union_headline_label= Label(union_window, text = union_headline[0], font=('Times',20,'bold'),  bg = 'White', width = 85, height = 3, wraplength = 500)

这并不能消除HTML字符。例如，标题打印为

^{pr2}$

我试图找到一个答案，但没有任何运气。任何帮助都是非常感谢的。在

Tags： url string 源代码网站 tkinter html 时间 regex

1条回答

网友

1楼 · 发布于 2024-10-04 07:36:21

您可以使用的“可调用”功能re.sub公司（）清除（或删除）任何逃逸的东西：

>>> import re
>>> def htmlUnescape(m):
...     return unichr(int(m.group(1), 16))
...
>>> re.sub('&#([^;]+);', htmlUnescape, "This is something &#8217; with an HTML-escaped character in it.")
u'This is something \u8217 with an HTML-escaped character in it.'
>>>

在python中替换HTML代码

相关问题更多 >

编程相关推荐

热门问题

热门文章

在python中替换HTML代码

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >