在Python3上使用beautifulsoup4从多个URL提取img src时遇到问题

2024-10-02 22:28:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试构建一个scraper,它将遍历产品页面URL列表,解析数据并从照片卷中提取img src URL,这些URL位于“li”元素下,而“ul”元素下具有唯一类“bxslider”。我只是简单地使用soup.findAll('img'['src']),但是在这个站点上还有很多其他的src-img,我不需要它们。我还需要排除类为“bx clone”的任何“li”标记。 我用的是硒、美苏和熊猫

我需要刮取的HTML:

<ul class="bxslider" style="width: 1315%; position: relative; left: -410px;"><li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/55/bb/55bb9511-676b-4585-8cf2-99af9ba8baca.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/06/a7/06a700dd-8350-4932-88e9-c941e73e0def.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8c/92/8c92207d-c422-4d94-894c-911a5330e227.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/e0/22/e0224832-75f5-432a-a223-177ff7ffd03c.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8a/e1/8ae1e8d4-76a1-4161-9b17-b7a97e1779fc.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/fc/d5/fcd5a35b-8fb5-463e-9a47-804850f17825.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/98/2e/982ea3c5-ce28-49c8-bef5-b0f85bd99807.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/23/e1/23e153df-75af-4f1b-a4dd-3e0fb1e5a28f.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/94/02/940268f9-04ed-4650-bd9f-01b113b5059b.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
<li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li></ul>

我编写的代码用于刮取img SRC并将其附加到imgs表的

from testurls import tdurls as urllist

# make an empty list to scrape into
imgs = []  
print('debug mode')

for url in urllist:
    driver.get(url)
    html = driver.page_source
    print('Url loop set up, moving deeper...')
    soup = bs(html, 'html.parser')
    bxslider = soup.find('ul', {'class':'bxslider'})
    for li in bxslider:
        print('Printing bxslider...')
        print(bxslider)
        try:
            bxs = bxslider.findChildren('li')
            print('Printing li children...')
            print(li)
            for li in bxs:
                h = li.findAll('img'['src'])
                imgs.append(h)
                print('Found children...')
                print(h)
                
                        
        
        except:        
            bxslider.findChildren("li", { "class" : "bx-clone" })




print('Alright... how did we do?')
print(imgs)                            
imgdf = pd.DataFrame({'imgs':imgs})
ndf = imgdf.append(urllist, ignore_index=True)
print(ndf)
ndf.to_csv('C:/Users/niall/.spyder-py3/didthisworklol.csv', index=False, encoding='utf-8')

我完全迷路了,而且对python和当前使用的所有模块都相当陌生。我需要将这些图像链接放在一个单元格中,与相应的产品页面URL一起,这样行看起来就像这样,只有一个逗号作为分隔符: productpagelink, image link | image link | image link

我在最后加入了熊猫的部分,因为虽然看起来我的imgs列表被正确地添加了,但我不想给我留下更多的错误空间,我想可能有一个明显的调整。 如果我遗漏了您需要帮助的任何内容,请告诉我,我将进行编辑。谢谢大家!

编辑:我无法共享URL,因为它位于受密码保护的网站后面;不过,Selenium可以很好地加载并遍历每个URL


Tags: srcnoneimgnetstylepositionlifloat
2条回答

另一种方法

from simplified_scrapy import SimplifiedDoc, utils
html = """
<ul class="bxslider" style="width: 1315%; position: relative; left: -410px;"><li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/55/bb/55bb9511-676b-4585-8cf2-99af9ba8baca.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/06/a7/06a700dd-8350-4932-88e9-c941e73e0def.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8c/92/8c92207d-c422-4d94-894c-911a5330e227.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/e0/22/e0224832-75f5-432a-a223-177ff7ffd03c.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8a/e1/8ae1e8d4-76a1-4161-9b17-b7a97e1779fc.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/fc/d5/fcd5a35b-8fb5-463e-9a47-804850f17825.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/98/2e/982ea3c5-ce28-49c8-bef5-b0f85bd99807.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/23/e1/23e153df-75af-4f1b-a4dd-3e0fb1e5a28f.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/94/02/940268f9-04ed-4650-bd9f-01b113b5059b.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
<li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li></ul>
"""
doc = SimplifiedDoc(html)
images = doc.select('ul.bxslider').selects('img').src
rows = [[src] for src in images] # Change [] to [[]]
utils.save2csv('didthisworklol.csv',rows,newline='') # Save data to file

结果:

//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg
//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783
//d1w0x2adoh4nzy.cloudfront.net/55/bb/55bb9511-676b-4585-8cf2-99af9ba8baca.jpg
......

这里有更多的例子:https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

可能不受欢迎的方法,但总是有re模块。这是一个更多的工作,但更多的乐趣,太多了

import re

html = """
<ul class="bxslider" style="width: 1315%; position: relative; left: -410px;"><li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/55/bb/55bb9511-676b-4585-8cf2-99af9ba8baca.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/06/a7/06a700dd-8350-4932-88e9-c941e73e0def.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8c/92/8c92207d-c422-4d94-894c-911a5330e227.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/e0/22/e0224832-75f5-432a-a223-177ff7ffd03c.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/8a/e1/8ae1e8d4-76a1-4161-9b17-b7a97e1779fc.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/fc/d5/fcd5a35b-8fb5-463e-9a47-804850f17825.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/98/2e/982ea3c5-ce28-49c8-bef5-b0f85bd99807.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/23/e1/23e153df-75af-4f1b-a4dd-3e0fb1e5a28f.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/94/02/940268f9-04ed-4650-bd9f-01b113b5059b.jpg"></li>
       <li style="float: left; list-style: outside none none; position: relative; width: 410px;"><img src="//d1w0x2adoh4nzy.cloudfront.net/b5/48/b548ce05-1ee1-486b-9f33-ea61625d25ba.jpg"></li>
<li style="float: left; list-style: outside none none; position: relative; width: 410px;" class="bx-clone"><img src="//d1w0x2adoh4nzy.cloudfront.net/50/1f/501f8112-f6a7-4710-bd48-3acb0976e8f3.jpg?timestamp=1600972726783"></li></ul>
"""

# Retrieve only list elements based on given <ul> class
list_section_pattern = r'(?:<ul class="bxslider" .*?>)(?P<target>.*?)(?:</ul>)'
p = re.compile(list_section_pattern, flags = re.DOTALL | re.MULTILINE)
list_section = p.search(html).group("target")



# Match pattern to get all URLs; This is pretty straightforward.
href_pattern = r'<img src="(.*?)">'
p = re.compile(href_pattern)

# This should be a list of parsed URLs
urls = p.findall(list_section)


def get_root_url(url_path):
    """Split by forward-slash; Keep everything except image filename."""
    return "/".join(url_path.split(r"/")[:-1])


# Create a dictionary of url roots and image url lists.
url_dict = {}
for url in urls:
    root = get_root_url(url)
    if not root in url_dict:
        url_dict[root] = [url]
    else:
        url_dict[root].append(url)

# Output string for csv file
csv_string = ""
for k, v in url_dict.items():
    # .join() elements with vertical bar.
    tmp = " | ".join(v)
    csv_string += f"{k}, {tmp}\n" # Add a newline character

with open(r"C:\Users\niall\.spyder-py3\didthisworklol.csv", "w", encoding="utf-8") as csvf:
    csvf.write(csv_string)

结果(以Excel格式):

enter image description here

编辑:以下是一些BeautifulSoup操作:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

# This should return all unordered list sections with the class 'bxslider'
for chunk in soup.find_all("ul", class_="bxslider"):

    # List comprehension to get all the urls from the img tag.
    urls = [img.attrs["src"] for img in chunk.find_all("img")]

然后可以使用与上面相同的方法发送到文件

相关问题 更多 >