试图在网页上循环搜索所有足球运动员的名字,但只得到第一个?

2024-09-30 05:15:42 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在努力搜集阿拉巴马州足球花名册上所有球员的名字,可以在这里找到:https://rolltide.com/roster.aspx?roster=226&path=football

我可以得到第一个球员的名字,但它在他之后停止,没有得到任何其他球员的名字。你知道吗

这是我的密码:


DesiredRoster = (URLEntry.get())

driver = webdriver.Firefox()

driver.get(DesiredRoster)

#Player Name

Name = driver.find_element_by_class_name('sidearm-roster-player-name')
PlayerName = Name.find_element_by_tag_name('a').text
print(PlayerName)

我如何循环浏览此网页以获取所有名称?你知道吗


numbers = driver.find_elements_by_class_name('sidearm-roster-player-jersey-number')
print(numbers.text)

AttributeError:“list”对象没有属性“text”

奇怪的是,如果我把elements改成element,它会打印出第一个玩家的号码


Tags: textnamegetbydriverelementfind名字
3条回答

在我的例子中,至少需要一个User-Agent头,然后我就可以使用requests。然后,您可以使用css类选择器收集父节点,然后循环这些父节点并将所需信息提取到数据帧中;同样,使用更快、更短的css选择器。如前所述,在本例中,关键是使用select收集所有父节点。这比硒的开销小。你知道吗


Py:

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

r = requests.get('https://rolltide.com/roster.aspx?roster=226&path=football', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
results = {}

for num, p in enumerate(soup.select('.sidearm-roster-player')):
    results[num] = {'position': p.select_one('.sidearm-roster-player-position >span:first-child').text.strip()
           ,'height': p.select_one('.sidearm-roster-player-height').text
           ,'weight': p.select_one('.sidearm-roster-player-weight').text
           ,'number': p.select_one('.sidearm-roster-player-jersey-number').text
           ,'name': p.select_one('.sidearm-roster-player-name a').text
           ,'year': p.select_one('.sidearm-roster-player-academic-year').text
           ,'hometown': p.select_one('.sidearm-roster-player-hometown').text
           ,'highschool': p.select_one('.sidearm-roster-player-highschool').text
          }
df = pd.DataFrame(results.values(), columns = ['position','height','weight','number','name','year','hometown','highschool'])
print(df)

R:

purrr用于处理父节点上的循环以写入df。^来自stringr的{}用于整理循环中一个子节点的输出。httr用于提供头。你知道吗

library(httr)
library(purrr)
library(rvest)
library(stringr)

headers = c('User-Agent' = 'Mozilla/5.0')
pg <- content(httr::GET(url = 'https://rolltide.com/roster.aspx?roster=226&path=football', httr::add_headers(.headers=headers)))

df <- map_df(pg%>%html_nodes('.sidearm-roster-player'), function(item) {

     data.frame(position = str_squish(item%>%html_node('.sidearm-roster-player-position >span:first-child')%>%html_text()),
                height = item%>%html_node('.sidearm-roster-player-height')%>%html_text(),
                weight = item%>%html_node('.sidearm-roster-player-weight')%>%html_text(),
                number = item%>%html_node('.sidearm-roster-player-jersey-number')%>%html_text(),
                name = item%>%html_node('.sidearm-roster-player-name a')%>%html_text(),
                year = item%>%html_node('.sidearm-roster-player-academic-year')%>%html_text(),
                hometown = item%>%html_node('.sidearm-roster-player-hometown')%>%html_text(),
                highschool = item%>%html_node('.sidearm-roster-player-highschool')%>%html_text(),
                stringsAsFactors=FALSE)
     })

View(df)

对于任何想要使用R(rvest)的人,下面是将花名册数据收集到数据框中的代码:

library(tidyverse)
library(magrittr)
library(rvest)

url <- "https://rolltide.com/roster.aspx?roster=226&path=football"
page <- url %>% read_html()

position <- list()
height <- list()
weight <- list()
number <- list()
name <- list()
yr <- list()
hometown <- list()
high.school <- list()

for (i in seq(1,250)) {
    position[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[1]/span[1]/text()')) %>% xml_text %>% str_trim
    height[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[1]/span[2]')) %>% xml_text
    weight[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[1]/span[3]/text()')) %>% xml_text
    number[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[2]/span/span')) %>% xml_text
    name[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[1]/div[2]/div[2]/p/a')) %>% xml_text
    yr[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[2]/div[1]/span[1]')) %>% xml_text
    hometown[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[2]/div[1]/span[2]/text()')) %>% xml_text
    high.school[[i]] <- page %>% html_nodes(xpath=paste0('//*[@id="main-content"]/article/div[4]/div/div[1]/div[2]/div[1]/section/ul/li[',i,']/div[1]/div[2]/div[1]/span[3]/text()')) %>% xml_text
}

position    %<>% tibble %>% unnest
height      %<>% tibble %>% unnest
weight      %<>% tibble %>% unnest
number      %<>% tibble %>% unnest
name        %<>% tibble %>% unnest
yr          %<>% tibble %>% unnest
hometown    %<>% tibble %>% unnest
high.school %<>% tibble %>% unnest

final <- bind_cols(position,height,weight,number,name,yr,hometown,high.school)
names(final) <- c("position","height","weight","number","name","yr","hometown","high.school")

技巧是选择Xpath而不是CSS选择器,并在html_nodes()调用中使用xpath=。你知道吗

这显然有点难看,但它不需要硒或其他沉重的设置。你知道吗

编辑:您应该查看上面QHarr的答案,以获得更精简的代码。你知道吗

您正在使用只返回单个值的driver方法find_element_by_class_name,请切换到find_elements_by_class_name以获取列表,然后遍历该列表:

names = driver.find_elements_by_class_name('sidearm-roster-player-name')
for name in names:
    player_name = name.find_element_by_tag_name('a').text
    print(player_name)

相关问题 更多 >

    热门问题