无法仅从网页的pdf文件中的表中获取名称

import io import PyPDF2 import requests URL = 'https://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/CertificationandComplianc/Downloads/SFFList.pdf' res = requests.get(URL) f = io.BytesIO(res.content) reader = PyPDF2.PdfFileReader(f) contents = reader.getPage(3).extractText() print(contents)

Facilit y Name Address City State Zip Phone Number Months as an SFFWillows Center 320 North Crawford Street Willows CA95988530-934-2834 5Winter Park Care & Rehabilitation Center 2970 Scarlett Rd Winter Park FL32792407-671-8030 and so on -----

1条回答

网友

1楼 · 发布于 2024-05-08 19:52:08

不幸的是，PDF不是结构化文档，它只是放置在坐标上的字符串/图像，以使其看起来与创建时完全一致，而不管哪个程序渲染它。这意味着您不能像HTML那样简单地解析它，因为表不是在<table>元素下，而是分散在一个页面上。你知道吗

请参见：

看看https://github.com/atlanhq/camelot，它可能会对你有所帮助

（这里最多有10页的表格，使用手册可能是一个更快的选择，除非你有很多这样的PDF。）

相关问题更多 >

编程相关推荐

热门问题

热门文章