Wiki抓取丢失的数据

2024-05-19 21:14:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从https://en.wikipedia.org/wiki/Megacity中提取表,作为我对刮片世界的第一次尝试(在完全透明的情况下,我从我阅读的博客中获取了这段代码)。我得到了项目,但我没有得到城市,而是在每个领域都得到了。 问题:为什么每个字段的结尾都有\n,为什么我的第一个字段(城市)为空?下面列出的是部分代码和输出

import requests
scrapeLink = 'https://en.wikipedia.org/wiki/Megacity'
page = requests.get(scrapeLink)

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

megaTable = soup.find_all('table')[1]


rowValList = []    
for i in range(len(megaTable.find_all('td'))):
    rowVal = megaTable.find_all('td')[i].get_text()
    rowValList.append(rowVal)

cityList = []
for i in range(0, len(rowValList), 6):
    cityList.append(rowValList[i])

countryList = []
for i in range(1, len(rowValList), 6):
    countryList.append(rowValList[i])

contList = []
for i in range(2, len(rowValList), 6):
    contList.append(rowValList[i])

popList = []
for i in range(3, len(rowValList), 6):
    popList.append(rowValList[i])

import pandas as pd

megaDf = pd.DataFrame()
megaDf['City'] = cityList
megaDf['Country'] = countryList
megaDf['Continent'] = contList
megaDf['Population'] = popList
megaDf

Output


Tags: inimportforlenrangeallfindappend
1条回答
网友
1楼 · 发布于 2024-05-19 21:14:09

原因是城市不是在td标签内,而是在th标签内

<th scope="row"><a href="/wiki/Bangalore" title="Bangalore">Bangalore</a></th>

你提到的第一个td实际上是image列。您可以通过获取th标记来选择城市名称

此外,您可以通过首先获取表中的行,然后为每一行选择必要的标记来简化爬虫程序,即thtd

import requests
from bs4 import BeautifulSoup

scrapeLink = "https://en.wikipedia.org/wiki/Megacity"
page = requests.get(scrapeLink)


soup = BeautifulSoup(page.content, "html.parser")

megaTable = soup.find_all("table")[1]

cities = []
# [:2] slices the array since the first 2 `tr` contains the headers 
for row in megaTable.find_all("tr")[2:]:
    city = row.th.get_text().strip()
    tds = row.find_all("td")
    country = tds[1].get_text().strip()
    continent = tds[2].get_text().strip()
    population = tds[3].get_text().strip()
    cities.append({
        "city": city,
        "country": country,
        "continent": continent,
        "popluation": population,
    })

print(cities)
[
    {
        "city": "Bangalore",
        "country": "India",
        "continent": "Asia",
        "population": "12,200,00"
    },
    # and so on
]

然后,您可以将列表转换为数据帧:

df = pd.DataFrame(cities)

相关问题 更多 >