使用selenium的Python Webscraping,并通过Web爬虫将数据加载到Pandas数据帧中

2024-09-26 22:55:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试从https://robloxsong.com/网站抓取所有曲目,Roblox ID,评级,并希望它们进入熊猫数据帧。但是当我尝试下面的代码时,它给了我一个带有所有曲目、ID和带有“\n”的评级的列表。此外,我希望无法跳过所有50页并获取所有数据

#Importing
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

webD = webdriver.Chrome(ChromeDriverManager().install())
webD.get('https://robloxsong.com/')

#Loding data form songs tag
elements = webD.find_elements_by_class_name('songs')

#Declaring DataFrame
result = pd.DataFrame(columns = ['Track','Roblox_ID','Rating'])

#Extracting Text 
listOfElements = []
for i in elements:
    listOfElements.append(i.text)
    
print(listOfElements)

当我打印元素列表时,下面是输出

>>> ["Track Roblox ID Rating\negg\n5128532009\nCopy\n30267\nCaillou Trap Remix\n212675193\nCopy\n26550\nZEROTWOOOOO\n3951847031\nCopy\n26045\nRUNNING IN THE OOFS! (EPIC)\n1051512943\nCopy\n25938\nSPOOKY SCARY SKELETONS (100,000+ sales)\n160442087\nCopy\n24106\nBanana Song (I'm A Banana)\n169360242\nCopy\n23065\nshrek anthem\n152828706\nCopy\n22810\nraining tacos\n142376088\nCopy\n19135\nGFMO - Hello (100k!!)\n214902446\nCopy\n19118\nWide Put in Walking Audio\n5356051569\nCopy\n13472\nRaining Tacos. (Original)\n142295308\nCopy\n13235\nNARWHALS\n130872377\nCopy\n12858\nOld Town Road\n2862170886\nCopy\n11888\nno\n130786686\nCopy\n11570\nCRAB RAVE OOF\n2590490779\nCopy\n11551\nKFC is illuminati confirmed ( ͡° ͜ʖ ͡° )\n205254380\nCopy\n10668\nNightcore - Titanium\n398159550\nCopy\n10667\nHelp Me Help You Logan Paul\n833322858\nCopy\n10631\nI Like Trains\n131072261\nCopy\n10271\nI'm Fine.\n513919776\nCopy\n9289\nAINT NOBODY GOT TIME FOR DAT\n130776739\nCopy\n9093\nRoxanne\n4277136473\nCopy\n8912\nFlamingo Intro\n6123746751\nCopy\n8836\nOld Town Road OOFED\n3180460921\nCopy\n8447\nWii Music\n1305251774\nCopy\n8364\nHow To Save A Life (Bass Boosted)\n727844285\nCopy\n8309\nDubstep Remix [26k+]\n130762736\nCopy\n8052\nEVERYBODY DO THE FLOP\n130778839\nCopy\n7962\nAnt, SeeDeng, Poke - PRESTONPLAYZ ROBLOX\n1096142805\nCopy\n7778\nYeah Yeah Yeahs - Heads Will Roll (JVH-C remix)\n290176752\nCopy\n7706\n♪ Nightcore - Light 'Em Up x Girl On Fire (S/V)\n587156015\nCopy\n7527\nDo the Harlem Shake!\n131154740\nCopy\n7314\nZero two but full song\n5060369688\nCopy\n7221\nInvinsible [NCS]\n6104227669\nCopy\n7011\nParty Music\n141820924\nCopy\n7009\n♫♫Ƴℴu'ѵҿ ßƏƏƝ ƮƦ☉ᏝᏝƎƊ♫♫\n142633540\nCopy\n6972\nRevenge (Minecraft Music)\n3807239428\nCopy\n6943\nOOF LASAGNA\n2866646141\nCopy\n6808\nAlbert Sings Despacito\n1398660411\nCopy\n6655\nDo A Barrel Roll!\n130791919\nCopy\n6647\nLadies And Gentlemen We Got Him\n2624663028\nCopy\n6642\nCreepy Music Box\n143382469\nCopy\n6516\nThe Roblox Song\n1784385682\nCopy\n6474\nZEROTWOOOOO with panda\n4459223174\nCopy\n6362\nsad violin\n135308045\nCopy\n6261\noofing in the 90's\n915288747\nCopy\n6092\nElevator Music\n130768299\nCopy\n5998\nFEED ME!\n130766856\nCopy\n5909\nTanqR Outro\n5812114304\nCopy\n5859\nMako - Beam (Proximity)\n165065112\nCopy\n5787"]

需要回答两个问题-

  1. 我应该如何将其放入数据帧
  2. 如何从所有50页中获取数据

Tags: 数据inhttpscomid列表musicelements
1条回答
网友
1楼 · 发布于 2024-09-26 22:55:10

只需使用requests库获取页面,使用pandas解析页面中的表即可。要获取所有页面,需要分别解析所有页面。以下代码可以将所有页面解析为一个DataFrame

import requests
import pandas as pd

def parse_tables(page_html):
    page_tables = pd.read_html(page_html)    # directly parse tables into pandas dataframe
    column_names = page_tables[0].columns    # save column names for later use
    page_tables[0].columns = range(len(column_names))
    # data is present in multiple <table> html tags but it looks like a single table, so combine data from all <table> tags
    df = pd.concat(page_tables)
    df.columns = column_names
    return df.reset_index(drop=True)

headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
base_url = "https://robloxsong.com/"
num_pages = 50    ##number of pages that you want to parse
ratings_tables = []

for page_num in range(1,num_pages):
    page_url = base_url + "?page=" + str(page_num)
    print("Parsing page " + str(page_num))
    response = requests.get(url, headers=headers)   # fetch the html page
    if response.ok:
        page_html = response.content
        page_table = parse_tables(page_html)
        ratings_tables.append(page_table)
    else:
        print("Unable to fetch page:", response.content)

final_ratings_table = pd.concat(ratings_tables).reset_index()

相关问题 更多 >

    热门问题