为什么BeautifulSoup库总是只忽略一个特定的<TR>元素？

r = requests.get("https://www.worldometers.info/coronavirus/") content = r.content soup = BeautifulSoup(content, "html.parser") all_rows = soup.find_all("tr") startingIndex = None for index,each in enumerate(all_rows,start=0): if "World" in each.text: # After that word "WORLD" comes TR elements of individual countries. startingIndex = index break top10 = all_rows[startingIndex+1:startingIndex+11] # here i select top 10 countries that i need. for index,each in enumerate(top10,start = 1): droebiti_list = each.text.split("\n") print(f"{index}){droebiti_list[1]} - {droebiti_list[6]}") # and printing info about recovered people

2条回答

网友

1楼 · 编辑于 2024-09-30 08:30:49

页面源变量content的国家顺序与表中的不同（由于javascript脚本或其他原因，顺序可能会改变）

因此，您可以收集所有数据并根据总案例对其重新排序

import requests,time
from bs4 import BeautifulSoup

# Get the page source and clear it
r = requests.get("https://www.worldometers.info/coronavirus/")
contents = r.content
soup = BeautifulSoup(contents, "html.parser")
table = soup.find("tbody") 
countries = table.find_all("tr")
startingIndex = None

# Here we will store the top ten countries values
total=list(range(10))
names=list(range(10))
recovered=list(range(10))

# Compare each "new" country with the current top ten
for index,each in enumerate(countries[8:]):
    droebiti_list = each.text.split("\n")
    for j in range(10):
        if int(droebiti_list[2].replace(',','')) > total[j]:

            for jj in reversed(range(j,10)):
                recovered[jj]=recovered[jj-1]
                names[jj]=names[jj-1]
                total[jj]=total[jj-1]

            recovered[j]=droebiti_list[6]
            names[j]=droebiti_list[1]
            total[j]=int(droebiti_list[2].replace(',',''))
            break

    print(f"{index}){droebiti_list[1]} - {droebiti_list[2]}") 

# Print the results    
for k in range(10):
    print(names[k],'\t\t\t',recovered[k])

有趣的输出：

USA              36,254
Spain            64,727
Italy            35,435
France           27,718
Germany              64,300
UK           N/A
China            77,663
Iran             45,983
Turkey           3,957
Belgium              6,707

网友

2楼 · 编辑于 2024-09-30 08:30:49

无法确保此代码正常工作（“我在错误的环境中进行此操作”），但要清除数据，此代码应正常工作：

r = requests.get("https://www.worldometers.info/coronavirus/")
    content = r.content
    soup = BeautifulSoup(content, "html.parser")
    all_rows = soup.find_all("tr")

    for elements_all_rows in all_rows: # Like you said this goes trough all 'tr' elements
        ScrapedResult = []
        elements_all_rows = soup.find_all("td") # In each Tr Element you now search for 'td' elements
        for elements_elements_all_rows in elements_all_rows: # Now you go trough the td and filter the text
            ScrapedResult.append(elements_elements_all_rows.getText())
        print(ScrapedResult)

您只需根据需要修改ScrapedResult

相关问题更多 >

编程相关推荐

热门问题

热门文章