<p>一些JavaScript被用于呈现表,因此<code>requests</code>无法获取表html。相反,使用<code>selenium</code>模拟web浏览器访问站点,然后将<code>page_source</code>传递给<code>BeautifulSoup</code></p>
<p>然后我遍历表中的行,将每个<code>insta_id</code>和<code>follower_count</code>保存到字典列表中,然后将所有内容转换为<code>pandas</code>{<cd8>}</p>
<pre><code>from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
url = "https://starngage.com/app/global/influencer/ranking/india"
options = webdriver.ChromeOptions()
options.headless = False # page didn't fully load HTML in headless = True
driver = webdriver.Chrome(options=options)
driver.implicitly_wait(2)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.close()
rows = soup.find('table').find_all('tr')
influencers = []
for row in rows[1:]: # skip header row
cols = row.find_all('td')
insta_id = '@' + cols[2].text.split('@')[1]
follower_count = cols[5].text
influencers.append({'insta_id': insta_id, 'follower_count': follower_count})
df = pd.DataFrame(influencers)
print(df)
insta_id follower_count
0 @priyankachopra 57.9M
1 @jacquelinef143 45.9M
2 @urvashirautela 31.7M
3 @kapilsharma 28.7M
4 @sachintendulkar 27.1M
.. ... ...
95 @tonykakkar 4M
96 @reem_sameer8 3.9M
97 @mominamustehsan 3.9M
98 @bpraak 3.9M
99 @djbravo47 3.9M
[100 rows x 2 columns]
</code></pre>