自动化无聊的东西图像网站下载问题的回答

自动化无聊的东西图像网站下载

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在写一个来自《自动化无聊的东西》一书的项目。任务如下： 图像站点下载程序 编写一个程序，进入Flickr或Imgur等照片共享网站，搜索一类照片，然后下载所有结果图像。您可以编写一个程序，用于任何具有搜索功能 这是我的密码： <pre><code>import requests, bs4, os # The outerHTML file which I got by rightClicking and copying the <html> tag on 'page source' flickrFile=open('flickrHtml.html',encoding="utf8") #Parsing the HTML document flickrSoup=bs4.BeautifulSoup(flickrFile,'html.parser') # categoryElem is the Element which has image source inside categoryElem=flickrSoup.select("a[class='overlay']") #len(categoryElem)=849 os.makedirs('FlickrImages', exist_ok=True) for i in range(len(categoryElem)-1): # Regex searching for the href import re html=str(categoryElem[i]) htmlRegex=re.compile(r'href.*/"') mo=htmlRegex.search(html) imageUrl=mo.group() imageUrl=imageUrl.replace('"','') imageUrl=imageUrl.replace('href=','') imageUrlFlickr="https://www.flickr.com"+str(imageUrl) # Downloading the response object of the Image URL res = requests.get(imageUrlFlickr) imageSoup=bs4.BeautifulSoup(res.text) picElem=imageSoup.select('div[class="view photo-well-media-scrappy-view requiredToShowOnServer"] img') # Regex searching for the jpg file in the picElem HTML element html=str(picElem) htmlRegex=re.compile(r'//live.*\.jpg') mo=htmlRegex.search(html) try: imageUrlRegex=mo.group() except Exception as exc: print('There was a problem: %s' % (exc)) res1=requests.get('https:'+imageUrlRegex) try: res1.raise_for_status() except Exception as exc: print('There was a problem: %s' % (exc)) # Dowloading the jpg to my folder imageFile = open(os.path.join('FlickrImages', os.path.basename(imageUrlRegex)), 'wb') for chunk in res1.iter_content(100000): imageFile.write(chunk) </code></pre> 在查找了<a href="https://stackoverflow.com/questions/6364138/how-to-get-fully-computed-html-instead-of-source-html">this question</a>之后，我估计为了下载图片“Sea”的所有400万个结果，我复制了整个OuterHTML（如回答问题时所述）。如果我没有看这个问题，也没有复制完整的HTML源代码（在我的代码中，它存储在<code>flickrFile=open('flickrHtml.html',encoding="utf8")</code>），我最终会得到<code>categoryElem</code>等于24，因此只下载24张图片，而不是849张图片 <blockquote> There are 4 million pictures, how do I download all of them, without having to download the HTML source to a separate file? </blockquote> 我正在考虑我的计划，以实现以下目标： <ol> <li>获取搜索的第一张图片的url--&gt；下载图片--&gt；获取下一张图片的url--&gt；下载图片。。。。等等，直到没有东西可以下载</李> </ol> 我没有使用第一种方法，因为我不知道如何获得第一张图片的链接。我试图获取它的URL，但当我从“照片流”中检查第一张图片（或任何其他图片）的元素时，它给了我一个指向特定用户的“照片流”的链接，而不是一般的“海上搜索照片流” <a href="https://www.flickr.com/search/?text=sea&view_all=1" rel="nofollow noreferrer">Here is the link for the photo stream Search</a> 如果有人也能帮我，那就太好了 <a href="https://josealermaiii.github.io/python-tutorials/_modules/AutomateTheBoringStuff/Ch11/Projects/P2_imageDownloader.html#main" rel="nofollow noreferrer">Here is some code</a>来自完成相同任务的人，但他只下载了前24张图片，这些图片显示在原始的、未呈现的HTML上

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

如果要使用<code>requests + Beautfulsoup</code>，请尝试以下操作（通过传递参数<code>page</code>）： <pre><code>import re, requests, threading, os from bs4 import BeautifulSoup def download_image(url): with open(os.path.basename(url), "wb") as f: f.write(requests.get(url).content) print(url, "download successfully") original_url = "https://www.flickr.com/search/?text=sea&view_all=1&page={}" pages = range(1, 5000) # not sure how many pages here for page in pages: concat_url = original_url.format(page) print("Now it is page", page) soup = BeautifulSoup(requests.get(concat_url).content, "lxml") soup_list = soup.select(".photo-list-photo-view") for element in soup_list: img_url = 'https:'+re.search(r'url\((.*)\)', element.get("style")).group(1) # the url like: https://live.staticflickr.com/xxx/xxxxx_m.jpg # if you want to get a clearer(and larger) picture, remove the "_m" in the end of the url. # For prevent IO block,I create a thread to download it.pass the url of the image as argument. threading.Thread(target=download_image, args=(img_url,)).start() </code></pre> <hr/> 如果使用selenium，可能会更简单，示例代码如下： <pre class="lang-py prettyprint-override"><code>from selenium import webdriver import re, requests, threading, os # download_image def download_image(url): with open(os.path.basename(url), "wb") as f: f.write(requests.get(url).content) driver = webdriver.Chrome() original_url = "https://www.flickr.com/search/?text=sea&view_all=1&page={}" pages = range(1, 5000) # not sure how many pages here for page in pages: concat_url = original_url.format(page) print("Now it is page", page) driver.get(concat_url) for element in driver.find_elements_by_css_selector(".photo-list-photo-view"): img_url = 'https:'+re.search(r'url\(\"(.*)\"\)', element.get_attribute("style")).group(1) # the url like: https://live.staticflickr.com/xxx/xxxxx_m.jpg # if you want to get a clearer(and larger) picture, remove the "_m" in the end of the url. # For prevent IO block,I create a thread to download it.pass the url of the image as argument. threading.Thread(target=download_image, args=(img_url, )).start() </code></pre> 并在我的电脑上成功下载 <a href="https://i.stack.imgur.com/YWUMR.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/YWUMR.png" alt="enter image description here"/></a>

自动化无聊的东西图像网站下载

1 个回答

相关Python问题