我正在为我的A级计算机科学课程编写这个程序,我正在尝试让一个爬虫从给定的用户跟踪列表中抓取所有找到的用户
脚本的开头如下所示:
import requests
# import database as db
from bs4 import BeautifulSoup
debug = True
def getStartNode(): # Get the Twitter profile of the starting node
global startNodeFollowing # Declare the nodes vars as global for use in external functions
global startNodeFollowers
global startNodeLink
if not debug: # If debugging == False, allow the user to enter any starting node Twitter profile
startNodeLink = input("Enter a link to the starting users Twitter profile\n[URL]: ")[:-1] # Get profile link, remove the last char from input (space char, needed to enter link in terminal)
else: # If debugging == True, have predetermined starting node to save time during development
startNodeLink = ("https://twitter.com/ckjellberg03")
startNodeFollowers = (startNodeLink + "/followers") # Create a new var using the starting node's Twitter profile, append for followers and following URL pages
startNodeFollowing = (startNodeLink + "/following")
爬虫在这里:
def spider(): # Web Crawler
getStartNode()
print("\nUsing:", startNodeLink)
urlFollowers = startNodeFollowers
sourceCode = requests.get(urlFollowers)
plainText = sourceCode.text # Source code of the URL (urlFollowers) in plain text format
soup = BeautifulSoup(plainText,'lxml') # BeautifulSoup object to search through plainText for specific items/classes etc
for link in soup.findAll('a', {'class': 'css-4rbku5 css-18t94o4 css-1dbjc4n r-1loqt21 r-1wbh5a2 r-dnmrzs r-1ny4l3l'}): # 'a' is a link in HTML (anchor), class is the Twitter class for a profile
href = link.get(href)
print(href) # Display everything found (development purposes)
从源代码看,我很确定从a/followers链接到Twitter个人资料的用户的类标识符是“css-4rbku5 css-18t94o4 css-1dbjc4n r-1LOKT21 r-1wbh5a2 r-dnmrzs r-1ny4l3l”,但打印结果不会显示任何内容
有什么建议给我指出正确的方向吗
谢谢
抓取Twitter非常困难(相信我,我已经尝试了各种方法),你可以使用Twitter API,但它们有限制(你不能只知道关注者的姓名和号码),如果你想用Twitter API抓取一些信息,你可以使用以下代码:
下面是如何在没有API的情况下实现它。一些困难源于使用权利 用户代理中的浏览器
相关问题 更多 >
编程相关推荐