<p>在我的例子中,至少需要一个<code>User-Agent</code>头,然后我就可以使用<code>requests</code>。然后,您可以使用css类选择器收集父节点,然后循环这些父节点并将所需信息提取到数据帧中;同样,使用更快、更短的css选择器。如前所述,在本例中,关键是使用<code>select</code>收集所有父节点。这比硒的开销小。你知道吗</p>
<hr/>
<p><strong>Py:</strong></p>
<pre><code>from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://rolltide.com/roster.aspx?roster=226&path=football', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
results = {}
for num, p in enumerate(soup.select('.sidearm-roster-player')):
results[num] = {'position': p.select_one('.sidearm-roster-player-position >span:first-child').text.strip()
,'height': p.select_one('.sidearm-roster-player-height').text
,'weight': p.select_one('.sidearm-roster-player-weight').text
,'number': p.select_one('.sidearm-roster-player-jersey-number').text
,'name': p.select_one('.sidearm-roster-player-name a').text
,'year': p.select_one('.sidearm-roster-player-academic-year').text
,'hometown': p.select_one('.sidearm-roster-player-hometown').text
,'highschool': p.select_one('.sidearm-roster-player-highschool').text
}
df = pd.DataFrame(results.values(), columns = ['position','height','weight','number','name','year','hometown','highschool'])
print(df)
</code></pre>
<hr/>
<p><strong>R:</strong></p>
<p><code>purrr</code>用于处理父节点上的循环以写入df。^来自<code>stringr</code>的{<cd5>}用于整理循环中一个子节点的输出。<code>httr</code>用于提供头。你知道吗</p>
<pre><code>library(httr)
library(purrr)
library(rvest)
library(stringr)
headers = c('User-Agent' = 'Mozilla/5.0')
pg <- content(httr::GET(url = 'https://rolltide.com/roster.aspx?roster=226&path=football', httr::add_headers(.headers=headers)))
df <- map_df(pg%>%html_nodes('.sidearm-roster-player'), function(item) {
data.frame(position = str_squish(item%>%html_node('.sidearm-roster-player-position >span:first-child')%>%html_text()),
height = item%>%html_node('.sidearm-roster-player-height')%>%html_text(),
weight = item%>%html_node('.sidearm-roster-player-weight')%>%html_text(),
number = item%>%html_node('.sidearm-roster-player-jersey-number')%>%html_text(),
name = item%>%html_node('.sidearm-roster-player-name a')%>%html_text(),
year = item%>%html_node('.sidearm-roster-player-academic-year')%>%html_text(),
hometown = item%>%html_node('.sidearm-roster-player-hometown')%>%html_text(),
highschool = item%>%html_node('.sidearm-roster-player-highschool')%>%html_text(),
stringsAsFactors=FALSE)
})
View(df)
</code></pre>