使用Python-ulsoupin-beauthout表提取值?

2024-07-04 14:32:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图编写一个Python脚本,从位于这个页面的表中提取一些标记值:https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/

我已经包含了一个HTML源代码的屏幕截图,但是我无法找到如何提取第6、7、8和9列的价格数据。下面是我已经写过的代码。在

{<1分$ import requests import pandas as pd from bs4 import BeautifulSoup url = 'https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/' response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') table1 = soup.find_all('table', class_= 'sd-table') #writing the first few columns to text file with open('examplefile.txt', 'w') as r: for row in table1.find_all('tr'): for cell in row.find_all('td'): r.write(cell.text.ljust(5)) r.write('\n')

我最终尝试提取每行的所有值,并将其保存到Pandas数据帧或CSV中。谢谢。 enter image description here


Tags: 数据httpsimportcomwindowsvirtualallfind
3条回答
soup = find_all ('table', {'class':'sd-table'})

熊猫可以用read_html自己处理这个问题。然后可以在结果帧中清理数据类型等。返回匹配项的数组-大致如下:

import pandas as pd

url = 'https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/'

dfs = pd.read_html(url, attrs={'class':'sd-table'})

print dfs[0]

希望有帮助!在

表值似乎嵌入了一个JSON字符串中,可以使用json.loads获得该字符串。然后我们可以通过指示国家地区的"regional"键来获得值。在

它有点复杂,但至少它得到了我们放入数据帧中的值,如下所示:

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import os
import numpy as np

# force maximum dataframe column width
pd.set_option('display.max_colwidth', 0)

url = 'https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all('div', {'class': 'row row-size3 column'})

region = 'us-west-2' # Adjust your region here

def parse_table_as_dataframe(table):
    data = []
    header = []
    c5 = c6 = c7 = c8 = []

    rows = []
    columns = []

    name = table.h3.text

    try:
        # This part gets the first word in each column header so the table
        # fits reasonably in the display, adjust to your preference 
        header = [h.text.split()[0].strip() for h in table.thead.find_all('th')][1::]
    except AttributeError:
        return 'N/A'

    for row in table.tbody.find_all('tr'):
        for c in row.find_all('td')[1::]:
            if c.text.strip() not in (u'', u'$-') :
                if 'dash' in c.text.strip():
                    columns.append('-') # replace "&dash; &dash:" with a `-`
                else:
                    columns.append(c.text.strip())  
            else:
                try:
                    data_text = c.span['data-amount']
                    # data = json.loads(data_text)['regional']['asia-pacific-southeast']
                    data = json.loads(data_text)['regional'][region]
                    columns.append(data)
                except (KeyError, TypeError):
                    columns.append('N/A')



    num_rows = len(table.tbody.find_all('tr'))
    num_columns = len(header)

    # For debugging
    # print(len(columns), columns)
    # print(num_rows, num_columns)

    df = pd.DataFrame(np.array(columns).reshape(num_rows, num_columns), columns=header)
    return df

for n, table in enumerate(tables):
    print(n, table.h3.text)
    print(parse_table_as_dataframe(table))

从页中获取24个数据帧,每个表对应一个:

^{pr2}$

相关问题 更多 >

    热门问题