如何编写代码来使用python从pdf文件中提取同一行上的特定文本和整数?

2024-07-03 07:18:12 发布

您现在位置:Python中文网/ 问答频道 /正文

下面是我在一个pdf文件中的数据,我想用python提取行"US stock price 100"中的整数100,使用关键字作为"US stock price"?你知道吗

****下方的PDF文件行*****

sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. 
Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? 
Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur
US stock price     100
"Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, 
totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. 
Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. 
Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, 
Abb price     50

下面是我用于文本提取的代码:

import PyPDF2
pdfFileObject = open(path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    Text=page.extractText()
    print(Text)

Tags: 文件stockpricesedetusutea
3条回答

您可以尝试使用包tika。你知道吗

from tika import parser

raw = parser.from_file('test.pdf')
print(raw['myText'])

我看到您使用的是PyPDF2,所以我为该模块提供了一个示例。我还提供了一个使用tika模块的示例。我决定使用regex来提取请求的文本。你知道吗

import re as regex
import PyPDF2

pdfFileObject = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
  page = pdfReader.getPage(i)
  text = page.extractText()

  # joining lines, because PyPDF2 
  # output isn't formatted correctly 
  pdf_text = ''.join(text.splitlines())

  find_stock_price = regex.findall(r'us stock price\s{2,}\d{2,4}\s', pdf_text, regex.IGNORECASE)
  if find_stock_price:
    # attempt to clean the output
    reformat_price = [regex.sub(r'\s\s+' , ' ', str(x).strip()) for x in find_stock_price]
    print(reformat_price)
    # output
    ['US stock price 100']

import re as regex
from tika import parser

parsedPDF = parser.from_file("test.pdf")
pdf = parsedPDF["content"]
pdf = pdf.replace('\n\n', '\n')

# joining lines, because tika 
# output isn't formatted correctly
pdf_text = ''.join(pdf.splitlines())

find_stock_price = regex.findall(r'us stock price\s{2,}\d{2,4}\s', pdf_text, regex.IGNORECASE)
if find_stock_price:
   # attempt to clean the output
   reformat_price = [regex.sub(r'\s\s+' , ' ', str(x).strip()) for x in find_stock_price]
   print(reformat_price)
   # output
   ['US stock price 100']

下面是在PDF文件中搜索关键字的代码。你知道吗

import PyPDF2
import re

object = PyPDF2.PdfFileReader("test.pdf")
numPages = object.getNumPages()
string = "US stock price"
for i in range(0, numPages):
    pageObj = object.getPage(i)
    print("this is page " + str(i)) 
    txt = pageObj.extractText() 
    resSearch = re.search(string, txt)
    print(resSearch)

相关问题 更多 >