网页刮痧Python(美食家汤)多页面和子页面

2024-06-02 16:06:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我用以下方法制作汤:

import pandas as pd 
import requests
from bs4 import BeautifulSoup
import os
import string

for i in string.ascii_uppercase[:27]:
    url = "https://myanimelist.net/anime.php?letter={}".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')

我正在尝试从这个网站“https://myanimelist.net”创建一个数据帧,我想在第一步动画标题,eps,类型

第二,在每一部动画的细节上(比如:https://myanimelist.net/anime/2928/hack__GU_Returner),我想收集用户分配的分数(例如:

^{pr2}$

以及

^{3}$

你能帮忙收集所有的信息吗?在

如果我的要求不清楚,告诉我。在


Tags: 方法fromhttpsimporturlpandasstringnet
1条回答
网友
1楼 · 发布于 2024-06-02 16:06:19

可以使用^{}函数直接对pandas执行此操作:

import pandas as pd 
import string

df = pd.DataFrame()

for i in string.ascii_uppercase[:1]:#[:27]:
    url = "https://myanimelist.net/anime.php?letter={}".format(i)
    print url
    tables = pd.read_html(url, header=0)

    if df.empty:
        df = tables[2]
    else:
        df = pd.concat([df, tables[2]])

print df    

这将返回在给定URL找到的所有表的列表。你只需要第二张桌子。这将为您提供一个开始的数据帧:

^{pr2}$

要使用BeautifulSoup执行此操作,可以使用以下方法:

from bs4 import BeautifulSoup
import pandas as pd 
import string
import requests

columns = [u'Title', u'Type', u'Eps.', u'Score']
df = pd.DataFrame()

for i in string.ascii_uppercase[:27]:
    url = "https://myanimelist.net/anime.php?letter={}".format(i)

    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')    
    table = soup.find_all('table')[2]

    for tr in table.find_all('tr')[1:]:
        row = [td.get_text(strip=True) for td in tr.find_all('td')[1:5]]
        url_sub = tr.find('a')['href']
        print url_sub

        r_sub = requests.get(url_sub)
        soup_sub = BeautifulSoup(r_sub.text, 'html.parser')

        all_scores = []     # each title has multiple lists of scores

        # Select all of the user assigned score tables
        for div in soup_sub.select('div.spaceit.textReadability.word-break.pt8.mt8'):
            scores = []     # scores for one block

            for tr_sub in div.div.table.find_all('tr'):
                scores.append([td_sub.text for td_sub in tr_sub.find_all('td')])
            all_scores.append(scores)

        print all_scores    # These probably need adding to the row. Not all have scores.

        df_row = pd.DataFrame([row], columns=columns)

        if df.empty:
            df = df_row
        else:
            df = pd.concat([df, df_row])

print df

对于每部电影,会创建一个所有找到的乐谱的列表,并将其附加到all_scores中,尽管还不清楚如何将其添加到主数据帧中。在

例如,分数可以是:

https://myanimelist.net/anime/320/A_Kite
[[[u'Overall', u'8'], [u'Story', u'8'], [u'Animation', u'7'], [u'Sound', u'7'], [u'Character', u'7'], [u'Enjoyment', u'8']], [[u'Overall', u'8'], [u'Story', u'8'], [u'Animation', u'10'], [u'Sound', u'0'], [u'Character', u'7'], [u'Enjoyment', u'10']], [[u'Overall', u'7'], [u'Story', u'7'], [u'Animation', u'8'], [u'Sound', u'6'], [u'Character', u'7'], [u'Enjoyment', u'8']], [[u'Overall', u'2'], [u'Story', u'2'], [u'Animation', u'2'], [u'Sound', u'2'], [u'Character', u'2'], [u'Enjoyment', u'2']]]

相关问题 更多 >