我正在从一个网站上检索一些内容,这个网站有几个列数相同的表,有pandas^{
在:
import multiprocessing
def process(url):
df_url = pd.read_html(url)
df = pd.concat(df_url, ignore_index=False)
return df_url
links = ['link1.com','link2.com','link3.com',...,'linkN.com']
pool = multiprocessing.Pool(processes=6)
df = pool.map(process, links)
df
尽管如此,我想我并没有准确地指定read_html()
列,所以我得到了这个格式错误的列表列表:
输出:
[[ Form Disponibility \
0 290090 01780-500-01) Unavailable - no product available for release.
Relation \
Relation drawbacks
0 NaN Removed
1 NaN Removed ],
[ Form \
Relation \
0 American Regent is currently releasing the 0.4...
1 American Regent is currently releasing the 1mg...
drawbacks
0 Demand increase for the drug
1 Removed ,
Form \
0 0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...
Disponibility Relation \
0 Product available NaN
2 Removed
3 Removed ]]
所以我的问题是,为了从上面的嵌套列表中得到一个数据帧,我应该移动哪个参数?。我试过header=0
,index_col=0
,match='"columns"'
,它们都不起作用,或者当我用pd.Dataframe()
创建pandas数据帧时,我需要进行扁平化吗?。我的主要目标是创建一个类似以下列的数据框架:
form, Disponibility, Relation, drawbacks
1
2
...
n
你可以这样做:
首先要返回串联的DF,而不是DF列表(因为
read_html
返回DFs的列表):然后为所有URL连接它们:
相关问题 更多 >
编程相关推荐