python中的网页抓取将pdf文件转换为txt文件

import csv from bs4 import BeautifulSoup import requests source=requests.get('https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm').text soup=BeautifulSoup(source,'lxml') for b in soup.find_all("a",href=True): if b.text=='Press Conference': lnk='https://www.federalreserve.gov'+b['href'] source2=requests.get(lnk).text soup2=BeautifulSoup(source2,'lxml') for c in soup2.find_all("a",href=True): if 'Press Conference Transcript'in c.text: lnk2='https://www.federalreserve.gov'+c['href'] source3=requests.get(lnk2).text soup3=BeautifulSoup(source3,'lxml') for d in soup3.find_all("div",attrs={"id","content"}): print(d) fileout = open('conf.txt', 'a') fileout.write(d)

2条回答

网友

1楼 · 编辑于 2024-09-22 16:33:17

因此，关于PDF抓取，我提出了以下建议：

import requests
import io
import PyPDF2

# Donwload PDF
URL = 'https://www.federalreserve.gov/monetarypolicy/files/monetary20200129a1.pdf'
pdf_bytes = requests.get(URL).content
# PDF Reader expects a file-like object
pdf_stream = io.BytesIO(pdf)
reader = PyPDF2.PdfFileReader(pdf_stream)
# Read the first page
page = reader.getPage(0)
page_content = page.extractText()
print(page_content.encode('utf-8'))

此外，它可能值得一看How to extract text from a PDF file?

网友

2楼 · 编辑于 2024-09-22 16:33:17

如果您坚持要签出库pyPDF2，请给出一个建议。如果您的PDF格式良好，则非常易于使用。代码示例看起来很简单，如下所示：

    from PyPDF2 import PdfFileReader

    def extract_information(pdf_path):
       with open(pdf_path, 'rb') as f:
         pdf = PdfFileReader(f)
         information = pdf.getDocumentInfo()
         number_of_pages = pdf.getNumPages()

PDFMiner也是一个很好的例子

这篇来自RealPython博客的文章有点老，但也是一个很好的信息来源

相关问题更多 >

编程相关推荐

热门问题

热门文章