用python regex抓取html问题的回答

用python regex抓取html

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我对python中的regex有一些问题。我有一些html页面，其中包含对我有用的信息。在保存页面时，encodig字符集是一种iso。。。它保存了所有德国典型的字母编码，例如“Fr%C3%BCchte”为Früchte和son on。 html的结构非常糟糕，因此唯一合理的方法就是使用regex。在 我在python中有一个regex： <pre><code>re.compile('<a\s+href="javascript.*?$\'(\w+).*?\s.(\d+.+\d+).*?(.*)\'$\">') </code></pre> 不幸的是，这并不是我想要的，因为编码的单词只会被部分提取，例如，结果是： ^{pr2}$ 也许我累了，但我看不出错误在哪里： hir html： <pre><code><td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160500', 'Fr%C3%BCchte in Alkohol')">Früchte in Alkohol</a></td> </tr> <tr valign="top"> <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td> </tr> <tr valign="top"> <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160400', 'Rumtopf')">Rumtopf</a></td> </tr> <tr valign="top"> <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td> </tr> <tr valign="top"> <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160300', 'Spirituosen (Bio)')">Spirituosen (Bio)</a></td> </tr> <tr valign="top"> <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td> </tr> <tr valign="top"> <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160200', 'Spirituosen zur Verarbeitung in der Confiserie')">Spirituosen zur Verarbeitung in der Confiserie</a></td> </tr> <tr valign="top"> <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td> </tr> <tr valign="top"> <td colspan="3" width="100%"><a href="javascript:sendForm('showSubGroups', '160100', 'Spirituosen, allgemein')">Spirituosen, allgemein</a></td> </tr> <tr valign="top"> <td colspan="3"><img src="NoName_Time_200843_93448%20-Dateien/pix.gif" height="5" width="1"></td> </tr> </tbody></table> </td> </tr> </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

用python regex抓取html

1 个回答

相关Python问题