给定一个包含多个垃圾链接的列表,如何以这种方式提取.pdf中完成的所有链接?

2024-10-03 15:28:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧列,每个单元格上有几个链接:

Name|COL
San Diego|'https://foo.com/energy_docs/tyv/2004/019787_S30_gasTOC.cfm https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/99/293-_9302SDFS 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf https://foo.com/energy_docs/tyv/98/019787-S16_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/019787-S15_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/019787-S14_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf https://foo.com/energy_docs/tyv/96/019787-S12_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/019787-S11_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/019787-S10_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S9_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S8_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/96/19-787s007_Amlodipine.cfm https://foo.com/energy_docs/tyv/pre96/019787-S6_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S5_gas 2.5 KM, 5.0 KM, & 10.0 KM GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S4_gas GAS_TPC.cfm https://foo.com/energy_docs/tyv/pre96/019787-S3_gas_toc.cfm https://foo.com/energy_docs/tyv/pre96/019787-S2_gas GAS_TPC.cfm'
Washington|'https://foo.com/energy_docs/a32/2007/022136.cfm'
Texas|'https://foo.com/energy/29380/no_ant/USA/2/2007.pdf'

如何按以下方式提取以.pdf结尾的所有链接:

Name|COL
San Diego|https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf
San Diego|https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf
San Diego|https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
San Diego|https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf 
San Diego|https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf 
San Diego|https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf
Washington|NaN
Texas|https://foo.com/energy/29380/no_ant/USA/2/2007.pdf

我试着:

你知道吗

import re

def url_extractor(row):

    url=str(row)

    r = re.compile('(http[^\s]+\.pdf)')

    urls = r.findall(url)

    if len(urls) == 0:

        return 'NaN'

    else:

        return ' '.join(urls)

​

在:

df4['COL'] = df4['COL'].apply(url_extractor)
df4

输出:

    Name    COL
0   San Diego   https://foo.com/energy_docs/tyv/99/19787s022_g...
1   Washington  NaN
2   Texas   https://foo.com/energy/29380/no_ant/USA/2/2007...

然而,我不明白如何做堆叠/分割行的一部分,以便获得一个链接/url上的每一行。例如,让我们检查第一行:

在:

df4['COL'][0]

输出:

'https://foo.com/energy_docs/tyv/99/19787s022_gas.pdfhttps://foo.com/energy_docs/tyv/2000/19787s021_gas.pdfhttps://foo.com/energy_docs/tyv/2000/19787-s017_report.pdfhttps://foo.com/energy_docs/tyv/99/19787-s018_gas.pdfhttps://foo.com/energy_docs/tyv/2000/19787-s017_report.pdfhttps://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf'

每个链接都应该“映射”到它的名称San Diego。你知道吗


Tags: httpscomdocspdffooenergysangas
2条回答

如果这已经加载到pandas数据帧中,那么可以使用pandas内置的string方法将COL中的字符串分解为列表,从列表中提取所需的元素,将列表列重组为一个长序列,然后将其与原始数据帧合并

# break COL into lists of strings that only end if '.pdf'
COL_series = df.COL.str.split().apply(lambda x: [y for y in x if y.endswith('pdf')])
# create a long format series from the lists
COL_series = COL_series.apply(pd.Series).stack().reset_index(level=1, drop=True)
COL_series.name = 'COL'

# merge with df
pd.merge(df.Name.reset_index(), 
         COL_series.reset_index(), 
         how='outer', 
         on='index').drop('index', axis=1)

# returns:
        Name                                                         COL
0  San Diego        https://foo.com/energy_docs/tyv/99/19787s022_gas.pdf
1  San Diego      https://foo.com/energy_docs/tyv/2000/19787s021_gas.pdf
2  San Diego  https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
3  San Diego       https://foo.com/energy_docs/tyv/99/19787-s018_gas.pdf
4  San Diego  https://foo.com/energy_docs/tyv/2000/19787-s017_report.pdf
5  San Diego       https://foo.com/energy_docs/tyv/97/19787-S013_gas.pdf
6 Washington                                                         NaN
7      Texas          https://foo.com/energy/29380/no_ant/USA/2/2007.pdf

你应该做[^\s]或更短的\S,而不是[^<]。然后再加上\.pdf。你知道吗

(http\S+\.pdf)

Live Demo

编辑:

是的,如果你想的话,你也可以使用单词边界。你知道吗

(\bhttp.*?\.pdf\b)

Live Demo

相关问题 更多 >