将html标记中的信息提取到

2024-10-01 07:38:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个满是html文件的文件夹。我正在尝试选择正确的html标记,以便正确打印引文,我需要的输出只是出版物编号和标题。到目前为止,我是在这么多年来的各种帖子的帮助下完成这项工作的

with open(filename, 'r',encoding='utf-8') as f:# start loop to read HTML files in folder
    patent = f.read()
    #print(filename)
    soup = BeautifulSoup(patent, 'html.parser') 
    x=soup.select('tr[itemprop="backwardReferencesOrig"]')
    backorigdf= pd.read_html(str(x))
    print(backorigdf.loc[: , ['Publication number', 'Title']

但我收到一条错误消息ValueError:找不到表。我希望以熊猫数据帧格式输出多个HTML文件引用,以便更容易分析数据。有人能告诉我我做错了什么吗?这是指向HTML文件https://patents.google.com/patent/US4458945?oq=US4458945A的链接。此文件保存在我的计算机上的HTML文件中,我不想从URL读取。我想从HTML文档中提取代码


Tags: 文件数据标记文件夹标题readhtmlfilename
2条回答

您可以使用pd.read_html()函数进行以下操作:

url = 'https://patents.google.com/patent/US4458945?oq=US4458945A'
tables = pd.read_html(url, match='Publication number') #select table with match string
print(len(tables))

7 #It found 7 tables 

您可以显示以下表格:

display(tables[0])

#OR

for i in tables : 
    display(i)

结果集中的一个示例表:

    Publication number  Priority date   Publication date    Assignee    Title
0   US1722679A (en) *   1927-05-11  1929-07-30  Standard Oil Dev Co Pressure method of working oil sands
1   US1884859A (en) *   1930-02-12  1932-10-25  Standard Oil Dev Co Method of and apparatus for installing mine wells
2   US2193219A (en) *   1938-01-04  1940-03-12  Bowie   Drilling wells through heaving or sloughing fo...
3   US2989294A (en) *   1956-05-10  1961-06-20  Alfred M Coker  Method and apparatus for developing oil fields...
4   US4165903A (en) *   1978-02-06  1979-08-28  Cobbs James H   Mine enhanced hydrocarbon recovery technique

注意:您可以更具体地使用match参数来查找所需的内容。如果您不添加匹配参数,它将从页面中带出所有表

表2和表3的结果:

try: 
    tables = pd.read_html('page.html', match='Publication number')
    result_df = pd.concat([tables[2],tables[3]],axis=0, ignore_index=True)
    display(result_df)
except:
    print('No tables Found')

    Publication number  Priority date   Publication date    Assignee    Title
0   US1722679A (en) *   1927-05-11  1929-07-30  Standard Oil Dev Co Pressure method of working oil sands
1   US1884859A (en) *   1930-02-12  1932-10-25  Standard Oil Dev Co Method of and apparatus for installing mine wells
2   US2193219A (en) *   1938-01-04  1940-03-12  Bowie   Drilling wells through heaving or sloughing fo...
3   US2989294A (en) *   1956-05-10  1961-06-20  Alfred M Coker  Method and apparatus for developing oil fields...
4   US4165903A (en) *   1978-02-06  1979-08-28  Cobbs James H   Mine enhanced hydrocarbon recovery technique
5   US1884858A (en) *   1929-03-22  1932-10-25  Standard Oil Dev Co Apparatus for simultaneously controlling oil m...
6   US1852717A (en) *   1930-09-08  1932-04-05  Union Oil Co    Gas lift appliance for oil wells
7   US1910762A (en) *   1932-03-08  1933-05-23  Union Oil Co    Gas lift apparatus
8   US2148327A (en) *   1937-12-14  1939-02-21  Gray Tool Co    Oil well completion apparatus
9   US3207221A (en) *   1963-03-21  1965-09-21  Brown Oil Tools Automatic blow-out preventor means
10  US3227229A (en) *   1963-08-28  1966-01-04  Richfield Oil Corp  Bit guide
11  US3613806A (en) *   1970-03-27  1971-10-19  Shell Oil Co    Drilling mud system
12  US3884261A (en) *   1973-11-26  1975-05-20  Frank Clynch    Remotely activated valve
13  US4046191A (en) *   1975-07-07  1977-09-06  Exxon Production Research Company   Subsea hydraulic choke
14  US4106562A (en) *   1977-05-16  1978-08-15  Union Oil Company Of California Wellhead apparatus
15  US4224988A (en) *   1978-07-03  1980-09-30  A. C. Co.   Device for and method of sensing conditions in...

了解预期结果的总数将有所帮助。在下面,我通过使用:contains以引文h2元素为目标,然后移动到相邻的表来检索25个独特的结果

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
 
r = requests.get('https://patents.google.com/patent/US4458945?oq=US4458945A')
soup = bs(r.content, 'lxml')
df = pd.concat([pd.read_html(str(t.find_next('table')))[0]
                for t in soup.select('h2:contains("Citations", "Family Cites")')])

df.drop_duplicates(inplace=True)
df.sort_values(by=['Priority date'], inplace=True)
df.reset_index(drop=True, inplace=True) 
print(df)

相关问题 更多 >