如何重新索引从html检索到的格式错误的列？ - 问答 - Python中文网

如何重新索引从html检索到的格式错误的列？

2024-09-30 01:21:13 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我正在从一个网站上检索一些内容，这个网站有几个列数相同的表，有pandas^{}。当我读取一个实际上有多个列数相同的表的链接时，pandas会有效地将所有表作为一个表来读取（类似于平面/规范化表）。但是，我对一个网站的链接列表也感兴趣（例如，一个平面表包含多个链接），因此我尝试了以下方法：

在：

import multiprocessing
def process(url):
    df_url = pd.read_html(url)
    df = pd.concat(df_url, ignore_index=False) 
    return df_url

links = ['link1.com','link2.com','link3.com',...,'linkN.com']

pool = multiprocessing.Pool(processes=6)
df = pool.map(process, links)
df

尽管如此，我想我并没有准确地指定read_html()列，所以我得到了这个格式错误的列表列表：

输出：

[[                Form     Disponibility  \
  0  290090 01780-500-01)  Unavailable - no product available for release.   

                             Relation  \

     Relation drawbacks  
  0                  NaN                        Removed 
  1                  NaN                        Removed ],
 [                                        Form  \

                                   Relation  \
  0  American Regent is currently releasing the 0.4...   
  1  American Regent is currently releasing the 1mg...   

     drawbacks  
  0  Demand increase for the drug  
  1                         Removed ,
                                          Form  \
  0  0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...   

    Disponibility  Relation  \
  0                            Product available                  NaN   
  2                        Removed 
  3                        Removed ]]

所以我的问题是，为了从上面的嵌套列表中得到一个数据帧，我应该移动哪个参数？。我试过header=0，index_col=0，match='"columns"'，它们都不起作用，或者当我用pd.Dataframe()创建pandas数据帧时，我需要进行扁平化吗？。我的主要目标是创建一个类似以下列的数据框架：

form, Disponibility, Relation, drawbacks
1 
2
...
n

Tags： the form com url pandas df 列表网站

1条回答

网友

1楼 · 发布于 2024-09-30 01:21:13

你可以这样做：

首先要返回串联的DF，而不是DF列表（因为read_html返回DFs的列表）：

def process(url):
    return pd.concat(pd.read_html(url), ignore_index=False)

然后为所有URL连接它们：

df = pd.concat(pool.map(process, links), ignore_index=True)

相关问题更多 >

编程相关推荐

热门问题

热门文章