我有一个包含物种名称的数据集。我想检索每个物种的作者从pdf文件,并添加新的名称(与作者)在一个新的列
我很难在我的数据集中迭代地添加每个新名称。我尝试了append
和concat
但没有成功
表如下所示:
>>> pandas.read_csv('data.csv')[0:10]
id_ref id_sp species
0 20053 60645 Species Subspecies
1 20053 61094 Acantholimon lycopodioides
2 20053 61095 Achillea millefolium
3 20053 61096 Aconitum chasmanthum
4 20053 61097 Aconitum heterophyllum
5 20053 61098 Aconitum laeve
6 20053 61099 Aconitum rotundifolium
7 20053 61100 Aconitum violaceum
8 20053 61101 Aconogonon alpinum
9 20053 61102 Aconogonon rumicifolium
以下是我目前的代码:
from PyPDF2 import PdfFileReader
import pandas
import regex
table = pandas.read_csv('mydata.csv')
table['full_name'] = ''
tmp = []
pdf = 'myfile.pdf'
pdf_r = PdfFileReader(pdf)
page_rg = range(29, 225)
for p in page_rg:
page = pdf_r.getPage(p)
text = page.extractText()
tmp.append(text)
full_text = ''.join(tmp)
for sp in table.species:
sp_re = sp + r'\s+[(A-Z][^:(\/]+(?=\s)'
if regex.search(sp_re, full_text):
full_name = regex.findall(sp_re, full_text)
else:
full_name = ''
# line of code to add the matched string in the 'full_name' column
在循环中打印full_name
会产生以下结果:
['Acantholimon lycopodioides (Girard) Boiss.']
['Achillea millefolium L.']
['Aconitum chasmanthum Stapf ex Holmes']
['Aconitum heterophyllum Wall. ex Royle']
['Aconitum laeve Royle']
['Aconitum rotundifolium Kar. & Kir.']
['Aconitum violaceum Jacquem. ex Stapf']
['Aconogonon alpinum (All.) Schur']
['Aconogonon rumicifolium (Royle ex Bab.) Hara']
所需输出为:
id_ref id_sp species full_name
0 20053 60645 Species Subspecies
1 20053 61094 Acantholimon lycopodioides Acantholimon lycopodioides (Girard) Boiss.
2 20053 61095 Achillea millefolium Achillea millefolium L.
3 20053 61096 Aconitum chasmanthum Aconitum chasmanthum Stapf ex Holmes
4 20053 61097 Aconitum heterophyllum Aconitum heterophyllum Wall. ex Royle
5 20053 61098 Aconitum laeve Aconitum laeve Royle
6 20053 61099 Aconitum rotundifolium Aconitum rotundifolium Kar. & Kir.
7 20053 61100 Aconitum violaceum Aconitum violaceum Jacquem. ex Stapf
8 20053 61101 Aconogonon alpinum Aconogonon alpinum (All.) Schur
9 20053 61102 Aconogonon rumicifolium Aconogonon rumicifolium (Royle ex Bab.) Hara
您可以使用enumerate和iloc修改循环,并在运行时填充全名列。我在下面的代码中修改了您的循环,以便您可以这样做:
从你的问题看来,全名可能是列表格式。在这种情况下,您可以在将full\u name[0]分配给表dataframe时更改full\u name,以便只获取列表中的字符串
相关问题 更多 >
编程相关推荐