如何从这个HTML标记中提取URL？

<a id='revSAR' href='http://www.amazon.com/Altec-Lansing-inMotion-Mobile-Speaker/product-reviews/B000EDKP8U/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=byRankDescending' class='txtsmall noTextDecoration'> See all 136 customer reviews </a>

3条回答

网友

1楼 · 编辑于 2024-07-05 11:20:53

您不需要匹配那些不必要的部分，如id=...，href=...，请尝试以下操作：

regex = 'http://.*\'\s+'

网友

2楼 · 编辑于 2024-07-05 11:20:53

首先，你的正则表达式为什么不起作用？在html中，属性用单引号引起来，而在正则表达式中则是双引号。你只需要关心href属性。尝试使用href=['"](.+?)['"]作为regex，如果使用ignore case开关会更好

但同样，使用regex解析html是一个非常糟糕的决定。请通过this

网友

3楼 · 编辑于 2024-07-05 11:20:53

你可以试试

(_, url), = re.findall(r'href=([\'"]*)(\S+)\1', input)
print url

然而，就个人而言，我更愿意使用像BeautifulSoup这样的HTML解析库来完成这样的任务。在

相关问题更多 >

编程相关推荐

热门问题

热门文章