使用Beauty soup从网页内的url中刮取数据。python

2024-06-01 14:48:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从网页内的url中获取数据(insta id和粉丝数):, https://starngage.com/app/global/influencer/ranking/india

url的元素id为:@priyankachopra

类似地,我想从同一个表中的所有链接中刮取数据

有人能告诉我怎么做吗

import requests
from bs4 import BeautifulSoup as bs

r = requests.get("https://starngage.com/app/global/influencer/ranking/india")

Tags: httpsimportcomidappurl网页requests
2条回答

一些JavaScript被用于呈现表,因此requests无法获取表html。相反,使用selenium模拟web浏览器访问站点,然后将page_source传递给BeautifulSoup

然后我遍历表中的行,将每个insta_idfollower_count保存到字典列表中,然后将所有内容转换为pandas{}

from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

url = "https://starngage.com/app/global/influencer/ranking/india"

options = webdriver.ChromeOptions()
options.headless = False # page didn't fully load HTML in headless = True

driver = webdriver.Chrome(options=options)
driver.implicitly_wait(2)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.close()

rows = soup.find('table').find_all('tr')

influencers = []
for row in rows[1:]: # skip header row
    cols = row.find_all('td')
    insta_id = '@' + cols[2].text.split('@')[1]
    follower_count = cols[5].text
    influencers.append({'insta_id': insta_id, 'follower_count': follower_count})

df = pd.DataFrame(influencers)

print(df)

                insta_id follower_count
    0    @priyankachopra          57.9M
    1    @jacquelinef143          45.9M
    2    @urvashirautela          31.7M
    3       @kapilsharma          28.7M
    4   @sachintendulkar          27.1M
    ..               ...            ...
    95       @tonykakkar             4M
    96     @reem_sameer8           3.9M
    97  @mominamustehsan           3.9M
    98           @bpraak           3.9M
    99        @djbravo47           3.9M
    
    [100 rows x 2 columns]

您可以直接在HTML中找到数据。只需使用beautifulsoup即可提取所需的数据

这是代码

from bs4 import BeautifulSoup
from prettytable import PrettyTable

tb = PrettyTable(['Name', 'Insta_ID', 'Followers'])
url = 'https://starngage.com/app/global/influencer/ranking/india'
resp = requests.get(url)

soup = BeautifulSoup(resp.text, 'html.parser')

table = soup.find('table', class_='table-responsive-sm')
td = table.findAll('tr')

for i in td[1:]:
    temp = i.select_one("td:nth-of-type(3)").text
    name, insta_id = temp.split('@')
    followers = i.select_one("td:nth-of-type(6)").text
    tb.add_row([name.strip(), insta_id.strip(), followers.strip()])

print(tb)

Sample Output:

+               -+             -+     -+
|              Name             |          Insta_ID         | Followers |
+               -+             -+     -+
|     Priyanka Chopra Jonas     |       priyankachopra      |   57.9M   |
|      Jacqueline Fernandez     |       jacquelinef143      |   45.9M   |
|    URVASHI RAUTELA 🇮🇳Actor    |       urvashirautela      |   31.7M   |
|          Kapil Sharma         |        kapilsharma        |   28.7M   |
|        Sachin Tendulkar       |      sachintendulkar      |   27.1M   |
|          Amanda Cerny         |        amandacerny        |   25.5M   |
|             Mia K.            |         miakhalifa        |   21.9M   |
|         KARTIK AARYAN         |        kartikaaryan       |   19.7M   |

相关问题 更多 >