在Python3上使用beautifulsoup4从多个URL提取img src时遇到问题问题的回答

在Python3上使用beautifulsoup4从多个URL提取img src时遇到问题

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在尝试构建一个scraper，它将遍历产品页面URL列表，解析数据并从照片卷中提取img src URL，这些URL位于“li”元素下，而“ul”元素下具有唯一类“bxslider”。我只是简单地使用soup.findAll（'img'['src']），但是在这个站点上还有很多其他的src-img，我不需要它们。我还需要排除类为“bx clone”的任何“li”标记。我用的是硒、美苏和熊猫 我需要刮取的HTML： <pre><code><ul class="bxslider" style="width: 1315%; position: relative; left: -410px;"><li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/55/bb/55bb9511-676b-4585-8cf2-99af9ba8baca.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/06/a7/06a700dd-8350-4932-88e9-c941e73e0def.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8c/92/8c92207d-c422-4d94-894c-911a5330e227.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/e0/22/e0224832-75f5-432a-a223-177ff7ffd03c.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8a/e1/8ae1e8d4-76a1-4161-9b17-b7a97e1779fc.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/fc/d5/fcd5a35b-8fb5-463e-9a47-804850f17825.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/98/2e/982ea3c5-ce28-49c8-bef5-b0f85bd99807.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/23/e1/23e153df-75af-4f1b-a4dd-3e0fb1e5a28f.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/94/02/940268f9-04ed-4650-bd9f-01b113b5059b.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li></ul> </code></pre> 我编写的代码用于刮取img SRC并将其附加到imgs表的 <pre><code>from testurls import tdurls as urllist # make an empty list to scrape into imgs = [] print('debug mode') for url in urllist: driver.get(url) html = driver.page_source print('Url loop set up, moving deeper...') soup = bs(html, 'html.parser') bxslider = soup.find('ul', {'class':'bxslider'}) for li in bxslider: print('Printing bxslider...') print(bxslider) try: bxs = bxslider.findChildren('li') print('Printing li children...') print(li) for li in bxs: h = li.findAll('img'['src']) imgs.append(h) print('Found children...') print(h) except: bxslider.findChildren("li", { "class" : "bx-clone" }) print('Alright... how did we do?') print(imgs) imgdf = pd.DataFrame({'imgs':imgs}) ndf = imgdf.append(urllist, ignore_index=True) print(ndf) ndf.to_csv('C:/Users/niall/.spyder-py3/didthisworklol.csv', index=False, encoding='utf-8') </code></pre> 我完全迷路了，而且对python和当前使用的所有模块都相当陌生。我需要将这些图像链接放在一个单元格中，与相应的产品页面URL一起，这样行看起来就像这样，只有一个逗号作为分隔符： <code>productpagelink, image link | image link | image link</code> 我在最后加入了熊猫的部分，因为虽然看起来我的imgs列表被正确地添加了，但我不想给我留下更多的错误空间，我想可能有一个明显的调整。如果我遗漏了您需要帮助的任何内容，请告诉我，我将进行编辑。谢谢大家! 编辑：我无法共享URL，因为它位于受密码保护的网站后面；不过，Selenium可以很好地加载并遍历每个URL

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

可能不受欢迎的方法，但总是有<code>re</code>模块。这是一个更多的工作，但更多的乐趣，太多了 <pre><code>import re html = """ <ul class="bxslider" style="width: 1315%; position: relative; left: -410px;"><li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/55/bb/55bb9511-676b-4585-8cf2-99af9ba8baca.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/06/a7/06a700dd-8350-4932-88e9-c941e73e0def.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8c/92/8c92207d-c422-4d94-894c-911a5330e227.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/e0/22/e0224832-75f5-432a-a223-177ff7ffd03c.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8a/e1/8ae1e8d4-76a1-4161-9b17-b7a97e1779fc.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/fc/d5/fcd5a35b-8fb5-463e-9a47-804850f17825.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/98/2e/982ea3c5-ce28-49c8-bef5-b0f85bd99807.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/23/e1/23e153df-75af-4f1b-a4dd-3e0fb1e5a28f.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/94/02/940268f9-04ed-4650-bd9f-01b113b5059b.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li> <li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li></ul> """ # Retrieve only list elements based on given <ul> class list_section_pattern = r'(?:<ul class="bxslider" .*?>)(?P<target>.*?)(?:</ul>)' p = re.compile(list_section_pattern, flags = re.DOTALL | re.MULTILINE) list_section = p.search(html).group("target") # Match pattern to get all URLs; This is pretty straightforward. href_pattern = r'<img src="(.*?)">' p = re.compile(href_pattern) # This should be a list of parsed URLs urls = p.findall(list_section) def get_root_url(url_path): """Split by forward-slash; Keep everything except image filename.""" return "/".join(url_path.split(r"/")[:-1]) # Create a dictionary of url roots and image url lists. url_dict = {} for url in urls: root = get_root_url(url) if not root in url_dict: url_dict[root] = [url] else: url_dict[root].append(url) # Output string for csv file csv_string = "" for k, v in url_dict.items(): # .join() elements with vertical bar. tmp = " | ".join(v) csv_string += f"{k}, {tmp}\n" # Add a newline character with open(r"C:\Users\niall\.spyder-py3\didthisworklol.csv", "w", encoding="utf-8") as csvf: csvf.write(csv_string) </code></pre> 结果（以Excel格式）： <a href="https://i.stack.imgur.com/M8wKF.jpg" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/M8wKF.jpg" alt="enter image description here"/></a> 编辑：以下是一些BeautifulSoup操作： <pre><code>from bs4 import BeautifulSoup soup = BeautifulSoup(html, "html.parser") # This should return all unordered list sections with the class 'bxslider' for chunk in soup.find_all("ul", class_="bxslider"): # List comprehension to get all the urls from the img tag. urls = [img.attrs["src"] for img in chunk.find_all("img")] </code></pre> 然后可以使用与上面相同的方法发送到文件

在Python3上使用beautifulsoup4从多个URL提取img src时遇到问题

1 个回答

相关Python问题