如何使用python将多个PDF中的表导入到单个数据帧中？

! pip install -q tabula-py ! pip install pandas import pandas as pd import tabula from tabula import read_pdf pdf = "http://trebhome.com/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf" data = read_pdf(pdf, output_format='dataframe', pages="all") data

[ Community Sales Dollar Volume ... Active Listings Avg. SP/LP Avg. DOM 0 Ajax 391 $265,999,351 ... 73 100% 21 1 Central East 32 $21,177,488 ... 3 99% 26 2 Northeast Ajax 70 $50,713,199 ... 18 100% 21 3 South East 105 $68,203,487 ... 15 100% 20 [4 rows x 9 columns]]

1条回答

网友

1楼 · 发布于 2024-09-30 00:27:59

以下代码几乎起作用：

pdf = "http://trebhome.com/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf"

from tabula import convert_into
convert_into(pdf, "test.csv", pages="all", lattice="true")

with open("test.csv",'r') as f:
    with open("updated_test.csv",'w') as f1:
        next(f) # skip header line
        for line in f:
            f1.write(line)

data = pd.read_csv("updated_test.csv")

# rename first column, drop unwanted rows

data.rename(columns = {'Unnamed: 0':'Community'}, inplace=True)
data.dropna(inplace=True)

data

并给出输出：

Community   Year    Quarter Sales   Dollar Volume   Average Price   Median Price    New Listings    Active Listings Avg. SP/LP
1   Central 2019    Q4  44.0    $27,618,950 $627,703    $630,500    67.0    8.0 99%
2   Central East    2019    Q4  32.0    $21,177,488 $661,797    $627,450    34.0    3.0 99%
3   Central West    2019    Q4  57.0    $40,742,450 $714,780    $675,000    65.0    7.0 99%
4   Northeast Ajax  2019    Q4  70.0    $50,713,199 $724,474    $716,500    82.0    18.0    100%
5   Northwest Ajax  2019    Q4  49.0    $37,192,790 $759,037    $765,000    63.0    14.0    99%
6   South East  2019    Q4  105.0   $68,203,487 $649,557    $640,000    117.0   15.0    100%
7   South West  2019    Q4  34.0    $20,350,987 $598,558    $590,000    36.0    8.0 99%

这里唯一的问题是，convert_into命令没有找到最后一列“Avg.DOM”

根据我的分析，这并不重要，但对于其他试图以类似方式拉表的人来说，这肯定是一个问题

相关问题更多 >

编程相关推荐

热门问题

热门文章