用python regex抓取html

<td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160500', 'Fr%C3%BCchte in Alkohol')">Früchte in Alkohol</a></td> </tr> <tr valign="top"> <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td> </tr> <tr valign="top"> <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160400', 'Rumtopf')">Rumtopf</a></td> </tr> <tr valign="top"> <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td> </tr> <tr valign="top"> <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160300', 'Spirituosen (Bio)')">Spirituosen (Bio)</a></td> </tr> <tr valign="top"> <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td> </tr> <tr valign="top"> <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160200', 'Spirituosen zur Verarbeitung in der Confiserie')">Spirituosen zur Verarbeitung in der Confiserie</a></td> </tr> <tr valign="top"> <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td> </tr> <tr valign="top"> <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160100', 'Spirituosen, allgemein')">Spirituosen, allgemein</a></td> </tr> <tr valign="top"> <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td> </tr> </tbody></table> </td> </tr>

2条回答

网友

1楼 · 编辑于 2024-09-26 04:49:42

试试这个：

f = re.compile("sendForm\((?:.*), (.*), (.*)\)")

以文本作为输入，它将给出以下内容：

^{pr2}$

至于解码%C3%BC（用于'u'）的话，它似乎只是拉丁语1块中的UTF-8，并额外添加了一些“%”，因为如果将“%”替换为“\x”，它就会解码：

In [39]: '\xC3\xBC'.decode('utf-8')
Out[39]: u'\xfc'

0x00FC是unicode forü。

网友

2楼 · 编辑于 2024-09-26 04:49:42

Beautiful Soup是解析html的一个很好的库。

一旦从html中提取了href，那么使用regex应该很容易。

相关问题更多 >

编程相关推荐

热门问题

热门文章