自动化无聊的东西图像网站下载问题的回答

自动化无聊的东西图像网站下载

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在写一个来自《自动化无聊的东西》一书的项目。任务如下： 图像站点下载程序 编写一个程序，进入Flickr或Imgur等照片共享网站，搜索一类照片，然后下载所有结果图像。您可以编写一个程序，用于任何具有搜索功能 这是我的密码： <pre><code>import requests, bs4, os # The outerHTML file which I got by rightClicking and copying the <html> tag on 'page source' flickrFile=open('flickrHtml.html',encoding="utf8") #Parsing the HTML document flickrSoup=bs4.BeautifulSoup(flickrFile,'html.parser') # categoryElem is the Element which has image source inside categoryElem=flickrSoup.select("a[class='overlay']") #len(categoryElem)=849 os.makedirs('FlickrImages', exist_ok=True) for i in range(len(categoryElem)-1): # Regex searching for the href import re html=str(categoryElem[i]) htmlRegex=re.compile(r'href.*/"') mo=htmlRegex.search(html) imageUrl=mo.group() imageUrl=imageUrl.replace('"','') imageUrl=imageUrl.replace('href=','') imageUrlFlickr="https://www.flickr.com"+str(imageUrl) # Downloading the response object of the Image URL res = requests.get(imageUrlFlickr) imageSoup=bs4.BeautifulSoup(res.text) picElem=imageSoup.select('div[class="view photo-well-media-scrappy-view requiredToShowOnServer"] img') # Regex searching for the jpg file in the picElem HTML element html=str(picElem) htmlRegex=re.compile(r'//live.*\.jpg') mo=htmlRegex.search(html) try: imageUrlRegex=mo.group() except Exception as exc: print('There was a problem: %s' % (exc)) res1=requests.get('https:'+imageUrlRegex) try: res1.raise_for_status() except Exception as exc: print('There was a problem: %s' % (exc)) # Dowloading the jpg to my folder imageFile = open(os.path.join('FlickrImages', os.path.basename(imageUrlRegex)), 'wb') for chunk in res1.iter_content(100000): imageFile.write(chunk) </code></pre> 在查找了<a href="https://stackoverflow.com/questions/6364138/how-to-get-fully-computed-html-instead-of-source-html">this question</a>之后，我估计为了下载图片“Sea”的所有400万个结果，我复制了整个OuterHTML（如回答问题时所述）。如果我没有看这个问题，也没有复制完整的HTML源代码（在我的代码中，它存储在<code>flickrFile=open('flickrHtml.html',encoding="utf8")</code>），我最终会得到<code>categoryElem</code>等于24，因此只下载24张图片，而不是849张图片 <blockquote> There are 4 million pictures, how do I download all of them, without having to download the HTML source to a separate file? </blockquote> 我正在考虑我的计划，以实现以下目标： <ol> <li>获取搜索的第一张图片的url--&gt；下载图片--&gt；获取下一张图片的url--&gt；下载图片。。。。等等，直到没有东西可以下载</李> </ol> 我没有使用第一种方法，因为我不知道如何获得第一张图片的链接。我试图获取它的URL，但当我从“照片流”中检查第一张图片（或任何其他图片）的元素时，它给了我一个指向特定用户的“照片流”的链接，而不是一般的“海上搜索照片流” <a href="https://www.flickr.com/search/?text=sea&view_all=1" rel="nofollow noreferrer">Here is the link for the photo stream Search</a> 如果有人也能帮我，那就太好了 <a href="https://josealermaiii.github.io/python-tutorials/_modules/AutomateTheBoringStuff/Ch11/Projects/P2_imageDownloader.html#main" rel="nofollow noreferrer">Here is some code</a>来自完成相同任务的人，但他只下载了前24张图片，这些图片显示在原始的、未呈现的HTML上

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

自动化无聊的东西图像网站下载

1 个回答

相关Python问题