利用报纸从HTML中提取图像

from newspaper import Article import requests url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/' raw_html= requests.get(url, verify=False, proxies=proxy) article = Article('') article.set_html(raw_html) article.top_image

2条回答

网友

1楼 · 编辑于 2024-10-01 15:47:19

Python模块允许使用代理，但该模块的文档中未列出此功能

报纸代理

from newspaper import Article
from newspaper.configuration import Configuration

# add your corporate proxy information and test the connection
PROXIES = {
           'http': "http://ip_address:port_number",
           'https': "https://ip_address:port_number"
          }

config = Configuration()
config.proxies = PROXIES

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
articles = Article(url, config=config)
articles.download()
articles.parse()
print(articles.top_image)
https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg

与代理和报纸的请求

import requests
from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html = requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.download(raw_html.content)
article.parse()
print(article.top_image) https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg

网友
2楼 · 编辑于 2024-10-01 15:47:19

首先，确保您使用的是python3，您之前已经运行过pip3 install newspaper3k
然后，如果第一个版本出现SSL错误（如下所示）
/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py:981: InsecureRequestWarning: Unverified HTTPS request is being made to host 'fox13now.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings warnings.warn(
您可以通过添加
import urllib3 urllib3.disable_warnings()
这应该起作用：
from newspaper import Article import urllib3 urllib3.disable_warnings() url = "https://www.fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/" article = Article(url) article.download() print(article.html)
与python3 <yourfile>.py一起运行
在文章中自己设置html对您没有多大好处，因为这样在其他字段中不会得到任何结果。让我知道这是否解决了问题，或者是否出现任何其他错误

报纸代理

与代理和报纸的请求

相关问题更多 >

编程相关推荐

热门问题

热门文章