python中的网页抓取将pdf文件转换为txt文件

2024-09-22 16:33:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试了几种方法获取美联储新闻发布会的抄本(PDF格式)并将其转换为.txt文件,但失败了。下面是我的原始代码。如有任何建议,将不胜感激

import csv
from bs4 import BeautifulSoup
import requests

source=requests.get('https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm').text
soup=BeautifulSoup(source,'lxml')

for b in soup.find_all("a",href=True):
    if b.text=='Press Conference':
        lnk='https://www.federalreserve.gov'+b['href']
        source2=requests.get(lnk).text
        soup2=BeautifulSoup(source2,'lxml')
        for c in soup2.find_all("a",href=True):
            if 'Press Conference Transcript'in c.text:
                lnk2='https://www.federalreserve.gov'+c['href']
                source3=requests.get(lnk2).text
                soup3=BeautifulSoup(source3,'lxml')
                for d in soup3.find_all("div",attrs={"id","content"}):
                    print(d)
                    fileout = open('conf.txt', 'a')
                    fileout.write(d)

Tags: textinhttpsimportforgetwwwall
2条回答

因此,关于PDF抓取,我提出了以下建议:

import requests
import io
import PyPDF2

# Donwload PDF
URL = 'https://www.federalreserve.gov/monetarypolicy/files/monetary20200129a1.pdf'
pdf_bytes = requests.get(URL).content
# PDF Reader expects a file-like object
pdf_stream = io.BytesIO(pdf)
reader = PyPDF2.PdfFileReader(pdf_stream)
# Read the first page
page = reader.getPage(0)
page_content = page.extractText()
print(page_content.encode('utf-8'))

此外,它可能值得一看How to extract text from a PDF file?

如果您坚持要签出库pyPDF2,请给出一个建议。如果您的PDF格式良好,则非常易于使用。代码示例看起来很简单,如下所示:

    from PyPDF2 import PdfFileReader

    def extract_information(pdf_path):
       with open(pdf_path, 'rb') as f:
         pdf = PdfFileReader(f)
         information = pdf.getDocumentInfo()
         number_of_pages = pdf.getNumPages()

PDFMiner也是一个很好的例子

这篇来自RealPython博客的文章有点老,但也是一个很好的信息来源

相关问题 更多 >