使用BeautifulSoup Python提取ids的内容

2024-10-03 21:28:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一些html页面,应该有1个表,一些行和2个列,我正在尝试将它们转换为cvs表。我想循环遍历行并获取列,但是我无法仅获取ID中的部分(例如 id="(-) Additional deductions of AT1 Capital due to Article 3 CRR"). Is there a way just to extract the content of id for each row?


import requests
from bs4 import BeautifulSoup
import pandas as pd


file = '/Users/tom/Downloads/Capitalresourceitemlevel1.html'

soup = BeautifulSoup(open(file), "html.parser")

table = soup.find_all('table')

for i in table:
    rows = i.find_all('tr')

for i in rows:
    row_tds = i.find_all('td')
    if len(row_tds) == 2:
        definitions.append((row_tds[0].text, row_tds[1].text))
with open('output.csv', 'w') as f:
    for line in definitions:


<html xmlns="http://www.w3.org/1999/xhtml">

        @media all
        font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
        font-size: 14px;
        line-height: 1.428571429;
        color: #333333;
        margin: 0;
        background-color: #ffffff;

        font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
        font-size: 36px;
        border-top: 4px solid transparent;
        border-right: 4px solid transparent;
        border-left: 4px solid transparent;
        border-bottom: 2px solid #f2f2f2;

        content: "Style: technical";
        font-size: 22px;

        font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
        font-size: 11px;
        line-height: 1.428571429;
        color: #333333;
        margin: 20;
        background-color: #ffffff;

        page-break-after: always;

        @bottom-left {
        margin: 10pt 0 30pt 0;
        border-top: .25pt solid #666;
        content: "My book";
        font-size: 9pt;
        color: #333;

        li span
        font-size: 18px;

        font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
        font-size: 24px;
        font-weight: normal;

        outline: thin dotted;

        text-decoration: none;

        text-decoration: underline;
        outline: 0;

        table {
        border-collapse: collapse;
        border-spacing: 0;
        border: 2px solid transparent;
        width: 99%;
        margin-bottom: 50px;

        caption-side: top;

        text-align: left;
        height: 50px;
        border-bottom: 1px solid #ddd;
        font-weight: normal;
        font-size: 18px;
        white-space: normal;

        text-align: left;
        padding: 8px;
        border-bottom: 1px solid #ddd;

        margin-top: 20px;
        margin-bottom: 50px;
        font-size: 18px;

        font-size: 18px;

        border: 2px solid #dddddd;
        border-bottom: 0px solid transparent;
        border-top-right-radius: 0.6em;
        border-top-left-radius: 0.6em;
        background-color: #f2f2f2;
        margin-bottom: 0px;
        height: 50px;
        /*width: 98%;*/
        position: relative

        font-size: 18px;
        padding-top: 0px;
        padding-bottom: 0px;
        margin: 0;
        position: absolute;
        top: 50%;
        transform: translate(0,-70);

        margin-top: 0px;
        border: 2px solid #dddddd;
        border-bottom-right-radius: 0.6em;
        border-bottom-left-radius: 0.6em;
        overflow-x: auto;

        font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
        font-size: 14px;
        font-style: normal;
        margin-bottom: 20px;
        border-top-left-radius: 0.6em;

        .float_center {
        float: right;
        position: relative;
        left: -50%;
        text-align: left;

        .float_center &gt; .child {
        position: relative;
        left: 50%;

        display: block;
        background-color: #f2f2f2;
        border: 2px solid #dddddd;
        font-size: 14px;
        height: 22px;
        padding: 10px 15px;
        margin-top: 20px;
        width: 98%;

        .footer  .copyright
        padding-left: 0px;

        .footer  .date
        padding-right: 0px;


        @media screen

        @media print

        @page {size: A4;}
        .footer {page-break-after: always;}
        #Welcome {font-size: 18px;  height:auto;}

<title>Bank of England</title>
<div>Stress Test Data Framework Dictionary 2021: Version 02</div>
<p id="breadcrumbs">&gt;<a href="..\..\..\Home.html" xmlns="">Home</a>&gt;<a href="..\..\..\Contents\Contents.html" xmlns="">Contents</a>&gt;<a href="..\..\..\Contents\Enumerations\Enumerations.html" xmlns="">Enumerations</a>&gt;<a href="..\..\..\Contents\Enumerations\Capitalresourceitemlevel1\Capitalresourceitemlevel1.html" xmlns="">Capitalresourceitemlevel1</a></p>
<h1 id="..\..\..\Contents\Enumerations\Capitalresourceitemlevel1\Capitalresourceitemlevel1.html" xmlns="">Capitalresourceitemlevel1</h1>
<p class="TableContainer" xmlns="">
<table id="table1">
<td id="(-) Additional deductions of AT1 Capital due to Article 3 CRR" text="">(-) Additional deductions of AT1 Capital due to Article 3 CRR</td>
<td id="" text="">COREP CA1 ID1.1.2.11</td>
<td id="(-) Additional deductions of CET1 Capital due to Article 3 CRR" text="">(-) Additional deductions of CET1 Capital due to Article 3 CRR</td>
<td id="" text="">COREP CA1 ID1.1.1.27</td>
<td id="(-) Additional deductions of T2 Capital due to Article 3 CRR" text="">(-) Additional deductions of T2 Capital due to Article 3 CRR</td>
<td id="" text="">COREP CA1 ID1.2.12</td>
<td id="(-) Amount exceeding the 17.65% threshold" text="">(-) Amount exceeding the 17.65% threshold</td>
<td id="" text="">COREP CA1 ID1.1.1.25</td>
<td id="(-) AT1 instruments of financial sector entities where the institution does not have a significant investment" text="">(-) AT1 instruments of financial sector entities where the institution does not have a significant investment</td>
<td id="" text="">COREP CA1 ID1.1.2.6</td>
<td id="(-) AT1 instruments of financial sector entities where the institution has a significant investment" text="">(-) AT1 instruments of financial sector entities where the institution has a significant investment</td>
<td id="" text="">COREP CA1 ID1.1.2.7</td>
<td id="(-) CET1 instruments of financial sector entities where the institution does not have a significant investment" text="">(-) CET1 instruments of financial sector entities where the institution does not have a significant investment</td>
<td id="" text="">COREP CA1 ID1.1.1.22</td>
<td id="(-) CET1 instruments of financial sector entities where the institution has a significant investment" text="">(-) CET1 instruments of financial sector entities where the institution has a significant investment</td>
<td id="" text="">COREP CA1 ID1.1.1.24</td>
<td id="(-) Deductible deferred tax assets that rely on future profitability and arise from temporary differences" text="">(-) Deductible deferred tax assets that rely on future profitability and arise from temporary differences</td>
<td id="" text="">COREP CA1 ID1.1.1.23</td>
<td id="(-) Deferred tax assets that rely on future profitability and do not arise from temporary differences net of associated tax liabilities" text="">(-) Deferred tax assets that rely on future profitability and do not arise from temporary differences net of associated tax liabilities</td>
<td id="" text="">COREP CA1 ID1.1.1.12</td>
<td id="(-) Equity exposures under an internal models approach which can alternatively be subject to a 1250% risk weight" text="">(-) Equity exposures under an internal models approach which can alternatively be subject to a 1250% risk weight</td>
<td id="" text="">COREP CA1 ID1.1.1.21</td>
<td id="(-) Excess of deduction from AT1 items over AT1 Capital" text="">(-) Excess of deduction from AT1 items over AT1 Capital</td>
<td id="" text="">COREP CA1 ID1.1.1.16</td>
<td id="(-) Excess of deduction from T2 items over T2 Capital" text="">(-) Excess of deduction from T2 items over T2 Capital</td>
<td id="" text="">COREP CA1 ID1.1.2.8</td>
<td id="(-) Free deliveries which can alternatively be subject to a 1250% risk weight" text="">(-) Free deliveries which can alternatively be subject to a 1250% risk weight</td>
<td id="" text="">COREP CA1 ID1.1.1.19</td>
<td id="(-) Goodwill" text="">(-) Goodwill</td>
<td id="" text="">COREP CA1 ID1.1.1.10</td>
<td id="(-) IRB shortfall of credit risk adjustments to expected losses" text="">(-) IRB shortfall of credit risk adjustments to expected losses</td>
<td id="" text="">COREP CA1 ID1.1.1.13</td>
<td id="(-) Other intangible assets" text="">(-) Other intangible assets</td>
<td id="" text="">COREP CA1 ID1.1.1.11</td>
<td id="(-) Positions in a basket for which an institution cannot determine the risk weight under the IRB approach, and can alternatively be subject to a 1250% risk weight" text="">(-) Positions in a basket for which an institution cannot determine the risk weight under the IRB approach, and can alternatively be subject to a 1250% risk weight</td>
<td id="" text="">COREP CA1 ID1.1.1.20</td>
<td id="(-) Qualifying holdings outside the financial sector which can alternatively be subject to a 1250% risk weight" text="">(-) Qualifying holdings outside the financial sector which can alternatively be subject to a 1250% risk weight</td>
<td id="" text="">COREP CA1 ID1.1.1.17</td>
<td id="(-) Reciprocal cross holdings in AT1 Capital" text="">(-) Reciprocal cross holdings in AT1 Capital</td>
<td id="" text="">COREP CA1 ID1.1.2.5</td>
<td id="(-) Reciprocal cross holdings in CET1 Capital" text="">(-) Reciprocal cross holdings in CET1 Capital</td>
<td id="" text="">COREP CA1 ID1.1.1.15</td>
<td id="(-) Reciprocal cross holdings in T2 Capital" text="">(-) Reciprocal cross holdings in T2 Capital</td>
<td id="" text="">COREP CA1 ID1.2.7</td>
<td id="(-) Securitisation positions which can alternatively be subject to a 1250% risk weight" text="">(-) Securitisation positions which can alternatively be subject to a 1250% risk weight</td>
<td id="" text="">COREP CA1 ID1.1.1.18</td>
<td id="(-) T2 instruments of financial sector entities where the institution does not have a significant investment" text="">(-) T2 instruments of financial sector entities where the institution does not have a significant investment</td>
<td id="" text="">COREP CA1 ID1.2.8</td>
<td id="(-) T2 instruments of financial sector entities where the institution has a significant investment" text="">(-) T2 instruments of financial sector entities where the institution has a significant investment</td>
<td id="" text="">COREP CA1 ID1.2.9</td>
<td id="(-)Defined benefit pension fund assets" text="">(-)Defined benefit pension fund assets</td>
<td id="" text="">COREP CA1 ID</td>
<td id="10% CET1 threshold" text="">10% CET1 threshold</td>
<td id="" text="">COREP CA4 ID9</td>
<td id="17.65% CET1 threshold" text="">17.65% CET1 threshold</td>
<td id="" text="">COREP CA4 ID10</td>
<td id="Accumulated other comprehensive income" text="">Accumulated other comprehensive income</td>
<td id="" text="">COREP CA1 ID1.1.1.3</td>
<td id="Additional Tier 1 Capital" text="">Additional Tier 1 Capital</td>
<td id="" text="">COREP CA1 ID1.1.2</td>
<td id="Adjustments to CET1 due to prudential filters" text="">Adjustments to CET1 due to prudential filters</td>
<td id="" text="">COREP CA1 ID1.1.1.9</td>
<td id="Adjustments to total own funds" text="">Adjustments to total own funds</td>
<td id="" text="">
<td id="AT1 capital elements or deductions - other" text="">AT1 capital elements or deductions - other</td>
<td id="" text="">COREP CA1 ID1.1.2.12</td>
<td id="Capital instruments and subordinated loans eligible as T2 Capital" text="">Capital instruments and subordinated loans eligible as T2 Capital</td>
<td id="" text="">COREP CA1 ID1.2.1</td>
<td id="Capital instruments eligible as AT1 Capital" text="">Capital instruments eligible as AT1 Capital</td>
<td id="" text="">COREP CA1 ID1.1.2.1</td>
<div class="footer" xmlns=""><span class="copyright">
        Copyright: Bank Of England</span><span class="float_center">
        PageNo:298</span><span class="date">
        Created: 2021-01-18 00:00:00</span></div>

Tags: ofthetotextidsizetrtd
1楼 · 发布于 2024-10-03 21:28:09


  • definitions.append([td.text for td in i.find_all('td')])
  • definitions.append([td['id'] for td in i.find_all('td')])(根据贾斯汀·埃泽奎尔的评论)


这就是您可以做到的,基本上您需要在每个td中保存两个项目的文本元组。 第二部分是将其写入文件的一种方法

for i in rows:
    row_tds = i.find_all('td')
    if len(row_tds) == 2:
        definitions.append((row_tds[0].text, row_tds[1].text))
with open('output.csv', 'w') as f:
    for line in definitions:

相关问题 更多 >