Python美化组提取特定url

<a href="http://www.iwashere.com/washere.html">next</a> ... <a href="http://www.heelo.com/hello.html">next</a> ... <a href="http://www.iwashere.com/wasnot.html">next</a> ...

2条回答

网友

1楼 · 编辑于 2024-06-28 15:40:31

可以匹配多个方面，包括对属性值使用正则表达式：

import re
soup.find_all('a', href=re.compile('http://www\.iwashere\.com/'))

哪个匹配（例如）：

[<a href="http://www.iwashere.com/washere.html">next</a>, <a href="http://www.iwashere.com/wasnot.html">next</a>]

所以任何具有href属性的<a>标记，其值以字符串http://www.iwashere.com/开头。

您可以循环查看结果，并只选择href属性：

>>> for elem in soup.find_all('a', href=re.compile('http://www\.iwashere\.com/')):
...     print elem['href']
... 
http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

要匹配所有相对路径，请使用一个否定的前瞻性断言，测试该值是否以schem（例如http:或mailto:）或双斜杠（//hostname/path）开头，而不是以相对路径开头：

soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))

网友

2楼 · 编辑于 2024-06-28 15:40:31

如果使用BeautifulSoup 4.0.0或更高版本：

soup.select('a[href^="http://www.iwashere.com/"]')

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python美化组提取特定url

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >