我目前正在做一个程序,从数以万计的法庭意见PDF中提取文本。我对Python比较陌生,正在努力使这段代码尽可能高效。我从这个网站和其他地方的许多帖子中收集到,我应该尝试对代码进行矢量化,但我尝试了三种方法,但都没有结果
我的reprex使用这些包和此示例数据
import os
import pandas as pd
import pdftotext
import wget
df = pd.DataFrame({'OpinionText': [""], 'URLs': ["https://cases.justia.com/federal/appellate-courts/ca6/20-6226/20-6226-2021-09-17.pdf?ts=1631908842"]})
df = pd.concat([df]*50, ignore_index=True)
我首先定义了这个函数,它下载PDF,提取文本,删除PDF,然后返回文本
def Link2Text(Link):
OpinionPDF = wget.download(Link, "Temporary_Opinion.pdf")
with open(OpinionPDF, "rb") as f:
pdf = pdftotext.PDF(f)
OpinionText = "\n\n".join(pdf)
if os.path.exists("Temporary_Opinion.pdf"):
os.remove("Temporary_Opinion.pdf")
return(OpinionText)
我调用该函数的第一种方法是:
df['OpinionText'] = df['URLs'].apply(Link2Text)
根据我读到的有关矢量化的内容,我尝试使用以下方法调用函数:
df['OpinionText'] = Link2Text(df['URLs'])
#and, alternatively:
df['OpinionText'] = Link2Text(df['URLs'].values)
两者都返回了相同的错误,即:
Traceback (most recent call last):
File "/Users/brendanbernicker/Downloads/Reprex for SO Vectorization Q.py", line 22, in <module>
df['OpinionText'] = Link2Text(df['URLs'])
File "/Users/brendanbernicker/Downloads/Reprex for SO Vectorization Q.py", line 10, in Link2Text
OpinionPDF = wget.download(Link, "Temporary_Opinion.pdf")
File "/Applications/anaconda3/lib/python3.8/site-packages/wget.py", line 505, in download
prefix = detect_filename(url, out)
File "/Applications/anaconda3/lib/python3.8/site-packages/wget.py", line 483, in detect_filename
if url:
File "/Applications/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py", line 1442, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
[Finished in 0.683s]
我推测这意味着Python不知道如何处理输入,因为它是一个向量,所以我尝试用下面的调用替换该调用,并得到了这个回溯
df['OpinionText'] = Link2Text(df['URLs'].item)
Traceback (most recent call last):
File "/Users/brendanbernicker/Downloads/Reprex for SO Vectorization Q.py", line 22, in <module>
df['OpinionText'] = Link2Text(df['URLs'].item)
File "/Users/brendanbernicker/Downloads/Reprex for SO Vectorization Q.py", line 10, in Link2Text
OpinionPDF = wget.download(Link, "Temporary_Opinion.pdf")
File "/Applications/anaconda3/lib/python3.8/site-packages/wget.py", line 505, in download
prefix = detect_filename(url, out)
File "/Applications/anaconda3/lib/python3.8/site-packages/wget.py", line 484, in detect_filename
names["url"] = filename_from_url(url) or ''
File "/Applications/anaconda3/lib/python3.8/site-packages/wget.py", line 230, in filename_from_url
fname = os.path.basename(urlparse.urlparse(url).path)
File "/Applications/anaconda3/lib/python3.8/urllib/parse.py", line 372, in urlparse
url, scheme, _coerce_result = _coerce_args(url, scheme)
File "/Applications/anaconda3/lib/python3.8/urllib/parse.py", line 124, in _coerce_args
return _decode_args(args) + (_encode_result,)
File "/Applications/anaconda3/lib/python3.8/urllib/parse.py", line 108, in _decode_args
return tuple(x.decode(encoding, errors) if x else '' for x in args)
File "/Applications/anaconda3/lib/python3.8/urllib/parse.py", line 108, in <genexpr>
return tuple(x.decode(encoding, errors) if x else '' for x in args)
AttributeError: 'function' object has no attribute 'decode'
我尝试将.decode('utf-8')
添加到我的函数调用中,并在函数中添加到输入中,但得到了相同的回溯。在这一点上,我不知道还有什么可以加速我的代码
我还尝试了numpy.vectorize
使用.apply
工作的版本,但它大大降低了执行速度。我认为这两个词不应该同时使用
为了完整起见,基于这里的一些优秀答案,我还尝试:
from numba import njit
@njit
def Link2Text(Link, Opinion):
res = np.empty(Link.shape)
for i in range(length(Link)):
OpinionPDF = wget.download(Link[i], "Temporary_Opinion.pdf")
with open(OpinionPDF, "rb") as f:
pdf = pdftotext.PDF(f)
OpinionText = "\n\n".join(pdf)
if os.path.exists("Temporary_Opinion.pdf"):
os.remove("Temporary_Opinion.pdf")
Opinion[i] = OpinionText
Link2Text(df['URLs'].values, df['OpinionText'].values)
我推测这不起作用,因为numba不适用于我在函数中调用的包,它更多地用于数学运算。如果这是不正确的,我应该尝试使用麻木为这个,请让我知道
目前没有回答
相关问题 更多 >
编程相关推荐