web抓取时欺骗IP地址（python）

3条回答

网友

1楼 · 编辑于 2024-06-24 13:38:31

为了克服IP速率禁令和隐藏您的真实IP，您需要使用代理。有很多不同的服务提供代理。考虑使用它们作为自己管理代理是一个真正的头痛和成本会更高。我建议https://botproxy.net等。它们通过单个端点提供旋转代理。以下是如何使用此服务发出请求：

#!/usr/bin/env python
import urllib.request
opener = urllib.request.build_opener(
    urllib.request.ProxyHandler(
        {'http': 'http://user-key:key-password@x.botproxy.net:8080',
         'https': 'http://user-key:key-password@x.botproxy.net:8080'}))
print(opener.open('https://httpbin.org/ip').read())

或使用请求库

^{pr2}$

他们在不同的国家也有代理人。在

网友

2楼 · 编辑于 2024-06-24 13:38:31

不久前我也遇到了同样的问题。这是我的代码片段，我正在使用它，以便匿名地抓取。在

from urllib.request import Request, urlopen
from fake_useragent import UserAgent
import random
from bs4 import BeautifulSoup
from IPython.core.display import clear_output

# Here I provide some proxies for not getting caught while scraping
ua = UserAgent() # From here we generate a random user agent
proxies = [] # Will contain proxies [ip, port]

# Main function
def main():
  # Retrieve latest proxies
  proxies_req = Request('https://www.sslproxies.org/')
  proxies_req.add_header('User-Agent', ua.random)
  proxies_doc = urlopen(proxies_req).read().decode('utf8')

  soup = BeautifulSoup(proxies_doc, 'html.parser')
  proxies_table = soup.find(id='proxylisttable')

  # Save proxies in the array
  for row in proxies_table.tbody.find_all('tr'):
    proxies.append({
      'ip':   row.find_all('td')[0].string,
      'port': row.find_all('td')[1].string
    })

  # Choose a random proxy
  proxy_index = random_proxy()
  proxy = proxies[proxy_index]

  for n in range(1, 20):
    req = Request('http://icanhazip.com')
    req.set_proxy(proxy['ip'] + ':' + proxy['port'], 'http')

    # Every 10 requests, generate a new proxy
    if n % 10 == 0:
      proxy_index = random_proxy()
      proxy = proxies[proxy_index]

    # Make the call
    try:
      my_ip = urlopen(req).read().decode('utf8')
      print('#' + str(n) + ': ' + my_ip)
      clear_output(wait = True)
    except: # If error, delete this proxy and find another one
      del proxies[proxy_index]
      print('Proxy ' + proxy['ip'] + ':' + proxy['port'] + ' deleted.')
      proxy_index = random_proxy()
      proxy = proxies[proxy_index]

# Retrieve a random index proxy (we need the index to delete it if not working)
def random_proxy():
  return random.randint(0, len(proxies) - 1)

if __name__ == '__main__':
  main()

这将创建一些正在工作的代理。这个部分：

^{pr2}$

这将创建不同的“标题”，假装是浏览器。最后但并非最不重要的是，只需将这些输入到request（）中。在

 # Make a get request
    user_agent = random.choice(user_agent_list)
    headers= {'User-Agent': user_agent, "Accept-Language": "en-US, en;q=0.5"}
    proxy = random.choice(proxies)
    response = get("your url", headers=headers, proxies=proxy)

希望能解决你的问题。在

否则请看这里：https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/

干杯

网友

3楼 · 编辑于 2024-06-24 13:38:31

这可能有助于匿名浏览。您可以使用一些免费的代理站点来获取代理并更新proxy={}。在

import requests
from bs4 import BeautifulSoup
url = ''
proxy = {"http":"http://","https":"http://"}
session = requests.session()
response = session.get(url,headers={'User-Agent': 'Mozilla/5.0'},proxies=proxy)
content = BeautifulSoup(response, 'html.parser')

相关问题更多 >

编程相关推荐

热门问题

热门文章

web抓取时欺骗IP地址（python）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >