我正在尝试做一个函数,在这里我向它提供一个URL列表,这些URL经过301跃点,它为我将其展平。我想将结果列表保存为CSV,这样我就可以将它交给开发人员,他们可以实现它并消除301跳
例如,我的爬虫程序将生成301个跃点的列表:
URL1 | URL2 | URL3 | URL4
example.com/url1 | example.com/url2 | |
example.com/url3 | example.com/url4 | example.com/url5 |
example.com/url6 | example.com/url7 | example.com/url8 | example.com/10
example.com/url9 | example.com/url7 | example.com/url8 |
example.com/url23 | example.com/url10 | |
example.com/url24 | example.com/url45 | example.com/url46 |
example.com/url25 | example.com/url45 | example.com/url46 |
example.com/url26 | example.com/url45 | example.com/url46 |
example.com/url27 | example.com/url45 | example.com/url46 |
example.com/url28 | example.com/url45 | example.com/url46 |
example.com/url29 | example.com/url45 | example.com/url46 |
example.com/url30 | example.com/url45 | example.com/url46 |
我想得到的结果是
URL1 | URL2
example.com/url1 | example.com/url2
example.com/url3 | example.com/url5
example.com/url4 | example.com/url5
example.com/url6 | example.com/10
example.com/url7 | example.com/10
example.com/url8 | example.com/10
example.com/url23 | example.com/url10
...
我已使用以下代码将Pandas数据框转换为列表列表:
import pandas as pd
import numpy as np
csv1 = pd.read_csv('Example_301_sheet.csv', header=None)
outlist = []
def link_flat(csv):
for row in csv.iterrows():
index, data = row
outlist.append(data.tolist())
return outlist
这会将每一行作为列表返回,并且它们都嵌套在一个列表中,如下所示:
[['example.com/url1', 'example.com/url2', nan, nan],
['example.com/url3', 'example.com/url4', 'example.com/url5', nan],
['example.com/url6',
'example.com/url7',
'example.com/url8',
'example.com/10'],
['example.com/url9', 'example.com/url7', 'example.com/url8', nan],
['example.com/url23', 'example.com/url10', nan, nan],
['example.com/url24', 'example.com/url45', 'example.com/url46', nan],
['example.com/url25', 'example.com/url45', 'example.com/url46', nan],
['example.com/url26', 'example.com/url45', 'example.com/url46', nan],
['example.com/url27', 'example.com/url45', 'example.com/url46', nan],
['example.com/url28', 'example.com/url45', 'example.com/url46', nan],
['example.com/url29', 'example.com/url45', 'example.com/url46', nan],
['example.com/url30', 'example.com/url45', 'example.com/url46', nan]]
如何将每个嵌套列表中的每个URL与同一列表中的最后一个URL进行匹配,以生成上述列表
您需要使用
groupby
+last
确定每行的最后一个有效项,然后重塑数据帧并使用melt
构建两列映射相关问题 更多 >
编程相关推荐