如何通过bs4查找所有字符串？

1条回答

网友

1楼 · 发布于 2024-09-30 01:20:36

我个人认为这是一种罕见的情况，在不使用HTML解析器的情况下将正则表达式应用于完整的文档是最简单也是很好的方法。而且，由于您实际上只是在查找URL，而不匹配正则表达式中的任何HTML标记，因此in this thread的点在这种情况下无效：

In [1]: data = """
   ...: <meta name="twitter:image" content="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869">
   ...: <img style="width:100%" id="box_img1" alt="box1" src="https://smtgvs.weathernews.jp/s/topics/img/dummy.png" class="lazy" data-original="https:
   ...: //smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797">`
   ...: <img style="width:100%" id="box_img2" alt="box2" src="https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518
   ...: ">
   ...: """

In [2]: import re

In [3]: pattern = re.compile(r"https://smtgvs.weathernews.jp/s/topics/img/[0-9]+/.+\?[0-9]+")

In [4]: pattern.findall(data)
Out[4]: 
['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797',
 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']

如果您对如何将正则表达式模式应用于BeautifulSoup中的多个属性感兴趣，那么可能是以下几行代码（我知道不太好）：

^{pr2}$
这里我们基本上是迭代所有元素的所有属性并检查模式匹配。然后，一旦我们得到了所有匹配的标记，我们就在结果上迭代，得到一个匹配属性的值。我真的不喜欢这样一个事实：我们应用正则表达式检查两次——在查找标记时，在检查匹配标记的所需属性时。在
^{}及其XPath功能允许直接处理属性，但lxml支持XPath1.0，后者不支持正则表达式。你可以像这样做短信：
In [10]: from lxml.html import fromstring In [11]: root = fromstring(data) In [12]: root.xpath('.//@*[contains(., "smtgvs.weathernews.jp") and contains(., "?")]') Out[12]: ['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869', 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797', 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']
这不是您所做的100%，可能会生成误报，但您可以进一步，并添加更多“字符串中的子字符串”检查（如果需要）。在
或者，您可以获取所有元素的所有属性并使用已有的正则表达式进行筛选：
In [14]: [attr for attr in root.xpath("//@*") if pattern.search(attr)] Out[14]: ['https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_sns_img_A.jpg?1532940869', 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img1_A.png?1503665797', 'https://smtgvs.weathernews.jp/s/topics/img/201807/201807300285_box_img2_A.jpg?1503378518']

相关问题更多 >

编程相关推荐

热门问题

热门文章