Wiki抓取丢失的数据

import requests scrapeLink = 'https://en.wikipedia.org/wiki/Megacity' page = requests.get(scrapeLink) from bs4 import BeautifulSoup soup = BeautifulSoup(page.content, 'html.parser') megaTable = soup.find_all('table')[1] rowValList = [] for i in range(len(megaTable.find_all('td'))): rowVal = megaTable.find_all('td')[i].get_text() rowValList.append(rowVal) cityList = [] for i in range(0, len(rowValList), 6): cityList.append(rowValList[i]) countryList = [] for i in range(1, len(rowValList), 6): countryList.append(rowValList[i]) contList = [] for i in range(2, len(rowValList), 6): contList.append(rowValList[i]) popList = [] for i in range(3, len(rowValList), 6): popList.append(rowValList[i]) import pandas as pd megaDf = pd.DataFrame() megaDf['City'] = cityList megaDf['Country'] = countryList megaDf['Continent'] = contList megaDf['Population'] = popList megaDf

1条回答

网友

1楼 · 发布于 2024-05-19 21:14:09

原因是城市不是在td标签内，而是在th标签内

<th scope="row"><a href="/wiki/Bangalore" title="Bangalore">Bangalore</a></th>

你提到的第一个td实际上是image列。您可以通过获取th标记来选择城市名称

此外，您可以通过首先获取表中的行，然后为每一行选择必要的标记来简化爬虫程序，即th和td

import requests
from bs4 import BeautifulSoup

scrapeLink = "https://en.wikipedia.org/wiki/Megacity"
page = requests.get(scrapeLink)


soup = BeautifulSoup(page.content, "html.parser")

megaTable = soup.find_all("table")[1]

cities = []
# [:2] slices the array since the first 2 `tr` contains the headers 
for row in megaTable.find_all("tr")[2:]:
    city = row.th.get_text().strip()
    tds = row.find_all("td")
    country = tds[1].get_text().strip()
    continent = tds[2].get_text().strip()
    population = tds[3].get_text().strip()
    cities.append({
        "city": city,
        "country": country,
        "continent": continent,
        "popluation": population,
    })

print(cities)
[
    {
        "city": "Bangalore",
        "country": "India",
        "continent": "Asia",
        "population": "12,200,00"
    },
    # and so on
]

然后，您可以将列表转换为数据帧：

df = pd.DataFrame(cities)

相关问题更多 >

编程相关推荐

热门问题

热门文章