在一个新列中附加来自pdf的迭代匹配模式

2024-10-01 22:27:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含物种名称的数据集。我想检索每个物种的作者从pdf文件,并添加新的名称(与作者)在一个新的列

我很难在我的数据集中迭代地添加每个新名称。我尝试了appendconcat但没有成功

表如下所示:

>>> pandas.read_csv('data.csv')[0:10]
   id_ref  id_sp                     species
0   20053  60645          Species Subspecies
1   20053  61094  Acantholimon lycopodioides
2   20053  61095        Achillea millefolium
3   20053  61096        Aconitum chasmanthum
4   20053  61097      Aconitum heterophyllum
5   20053  61098              Aconitum laeve
6   20053  61099      Aconitum rotundifolium
7   20053  61100          Aconitum violaceum
8   20053  61101          Aconogonon alpinum
9   20053  61102     Aconogonon rumicifolium

以下是我目前的代码:

from PyPDF2 import PdfFileReader
import pandas
import regex

table = pandas.read_csv('mydata.csv')
table['full_name'] = ''

tmp = []

pdf = 'myfile.pdf'
pdf_r = PdfFileReader(pdf)
page_rg = range(29, 225)
for p in page_rg:
    page = pdf_r.getPage(p)
    text = page.extractText()
    tmp.append(text)

full_text = ''.join(tmp)

for sp in table.species:
    sp_re = sp + r'\s+[(A-Z][^:(\/]+(?=\s)'
    if regex.search(sp_re, full_text):
        full_name = regex.findall(sp_re, full_text)
    else:
        full_name = ''
    # line of code to add the matched string in the 'full_name' column

在循环中打印full_name会产生以下结果:


['Acantholimon lycopodioides (Girard) Boiss.']
['Achillea millefolium L.']
['Aconitum chasmanthum Stapf ex Holmes']
['Aconitum heterophyllum Wall. ex Royle']
['Aconitum laeve Royle']
['Aconitum rotundifolium Kar. & Kir.']
['Aconitum violaceum Jacquem. ex Stapf']
['Aconogonon alpinum (All.) Schur']
['Aconogonon rumicifolium (Royle ex Bab.) Hara']

所需输出为:

   id_ref  id_sp                     species                                     full_name
0   20053  60645          Species Subspecies          
1   20053  61094  Acantholimon lycopodioides    Acantholimon lycopodioides (Girard) Boiss.
2   20053  61095        Achillea millefolium                       Achillea millefolium L.
3   20053  61096        Aconitum chasmanthum          Aconitum chasmanthum Stapf ex Holmes
4   20053  61097      Aconitum heterophyllum         Aconitum heterophyllum Wall. ex Royle
5   20053  61098              Aconitum laeve                          Aconitum laeve Royle
6   20053  61099      Aconitum rotundifolium            Aconitum rotundifolium Kar. & Kir.
7   20053  61100          Aconitum violaceum          Aconitum violaceum Jacquem. ex Stapf
8   20053  61101          Aconogonon alpinum               Aconogonon alpinum (All.) Schur
9   20053  61102     Aconogonon rumicifolium  Aconogonon rumicifolium (Royle ex Bab.) Hara

Tags: csvtextnameidpdfspfullex
1条回答
网友
1楼 · 发布于 2024-10-01 22:27:54

您可以使用enumerate和iloc修改循环,并在运行时填充全名列。我在下面的代码中修改了您的循环,以便您可以这样做:

for i, sp in enumerate(table.species):
    sp_re = sp + r'\s+[(A-Z][^:(\/]+(?=\s)'
    if regex.search(sp_re, full_text):
        full_name = regex.findall(sp_re, full_text)
    else:
        full_name = ''
    table.full_name.iloc[i] = full_name

从你的问题看来,全名可能是列表格式。在这种情况下,您可以在将full\u name[0]分配给表dataframe时更改full\u name,以便只获取列表中的字符串

相关问题 更多 >

    热门问题