为什么BeautifulSoup库总是只忽略一个特定的<TR>元素?

2024-09-30 08:30:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从世界计量器中提取各国有关电晕病例的信息。 由于某些原因,我不能按类来针对特定的TR标记(python控制台中缺少这些标记上的类,但ChromeDeveloper中有)。所以我以所有tr元素为目标,然后过滤它们。一切都很好,但出于某种奇怪的原因,中国在前十名国家中被遗漏了。中国的html标签没有什么不同,但我还是不能把它放在那里。有什么想法吗? '''

r = requests.get("https://www.worldometers.info/coronavirus/")
content = r.content
soup = BeautifulSoup(content, "html.parser")
all_rows = soup.find_all("tr") 
startingIndex = None

for index,each in enumerate(all_rows,start=0):
    if "World" in each.text: # After that word "WORLD" comes TR elements of individual countries. 
        startingIndex = index
        break

top10 = all_rows[startingIndex+1:startingIndex+11] # here i select top 10 countries that i need.

for index,each in enumerate(top10,start = 1):
    droebiti_list = each.text.split("\n")
    print(f"{index}){droebiti_list[1]} - {droebiti_list[6]}") # and printing info about recovered people

'''


Tags: in标记infoindexhtml原因contentall
2条回答

页面源变量content的国家顺序与表中的不同(由于javascript脚本或其他原因,顺序可能会改变)

因此,您可以收集所有数据并根据总案例对其重新排序

import requests,time
from bs4 import BeautifulSoup

# Get the page source and clear it
r = requests.get("https://www.worldometers.info/coronavirus/")
contents = r.content
soup = BeautifulSoup(contents, "html.parser")
table = soup.find("tbody") 
countries = table.find_all("tr")
startingIndex = None

# Here we will store the top ten countries values
total=list(range(10))
names=list(range(10))
recovered=list(range(10))

# Compare each "new" country with the current top ten
for index,each in enumerate(countries[8:]):
    droebiti_list = each.text.split("\n")
    for j in range(10):
        if int(droebiti_list[2].replace(',','')) > total[j]:

            for jj in reversed(range(j,10)):
                recovered[jj]=recovered[jj-1]
                names[jj]=names[jj-1]
                total[jj]=total[jj-1]

            recovered[j]=droebiti_list[6]
            names[j]=droebiti_list[1]
            total[j]=int(droebiti_list[2].replace(',',''))
            break

    print(f"{index}){droebiti_list[1]} - {droebiti_list[2]}") 

# Print the results    
for k in range(10):
    print(names[k],'\t\t\t',recovered[k])

有趣的输出:

USA              36,254
Spain            64,727
Italy            35,435
France           27,718
Germany              64,300
UK           N/A
China            77,663
Iran             45,983
Turkey           3,957
Belgium              6,707

无法确保此代码正常工作(“我在错误的环境中进行此操作”),但要清除数据,此代码应正常工作:

r = requests.get("https://www.worldometers.info/coronavirus/")
    content = r.content
    soup = BeautifulSoup(content, "html.parser")
    all_rows = soup.find_all("tr")

    for elements_all_rows in all_rows: # Like you said this goes trough all 'tr' elements
        ScrapedResult = []
        elements_all_rows = soup.find_all("td") # In each Tr Element you now search for 'td' elements
        for elements_elements_all_rows in elements_all_rows: # Now you go trough the td and filter the text
            ScrapedResult.append(elements_elements_all_rows.getText())
        print(ScrapedResult)

您只需根据需要修改ScrapedResult

相关问题 更多 >

    热门问题