如何将范围字符串（箱）转换为数值，然后与Seaborn可视化一起使用

'Age': {0: '45-54', 1: '35-44', 2: '45-54', 3: '45-54', 4: '55-64'}, 'Ethnicity': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'}, 'Approximate Household Income': {0: '$175,000 - $199,999', 1: '$75,000 - $99,999', 2: '$25,000 - $49,999', 3: '$50,000 - $74,999', 4: nan}, 'Highest Level of Education Completed': {0: 'Four Year College Degree', 1: 'Four Year College Degree', 2: 'Jr College/Associates Degree', 3: 'Jr College/Associates Degree', 4: 'Four Year College Degree'}, '2020 Candidate Choice': {0: 'Joe Biden', 1: 'Joe Biden', 2: 'Donald Trump', 3: 'Joe Biden', 4: 'Donald Trump'}, '2016 Candidate Choice': {0: 'Hillary Clinton', 1: 'Third Party', 2: 'Donald Trump', 3: 'Hillary Clinton', 4: 'Third Party'}, 'Party Registration 2020': {0: 'Independent', 1: 'No Party', 2: 'No Party', 3: 'Independent', 4: 'Independent'}, 'Registered State for Voting': {0: 'Colorado', 1: 'Virginia', 2: 'California', 3: 'North Carolina', 4: 'Oregon'}

2条回答

网友

1楼 · 编辑于 2024-09-30 01:33:12

您可以使用一些pandasSeries.str方法

较小的示例数据集：

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "Age": {0: "45-54", 1: "35-44", 2: "45-54", 3: "45-54", 4: "55-64"},
        "Ethnicity": {0: "White", 1: "White", 2: "White", 3: "White", 4: "White"},
        "Approximate Household Income": {
            0: "$175,000 - $199,999",
            1: "$75,000 - $99,999",
            2: "$25,000 - $49,999",
            3: "$50,000 - $74,999",
            4: np.nan,
        },
    }
)
#      Age Ethnicity Approximate Household Income
# 0  45-54     White          $175,000 - $199,999
# 1  35-44     White            $75,000 - $99,999
# 2  45-54     White            $25,000 - $49,999
# 3  45-54     White            $50,000 - $74,999
# 4  55-64     White                          NaN

我们可以遍历列列表并应用这些方法来解析pandas.DataFrame中的所有范围：

我们将按顺序使用的方法：

^{}-将逗号替换为零
^{}-从序列regex explained here中提取数字
^{}-将提取的数字转换为floats
^{}-重命名新列
^{}-将提取的数字添加回原始数据帧

for col in ["Age", "Approximate Household Income"]:
    df = df.join(
        df[col]
        .str.replace(",", "", regex=False)
        .str.extract(pat=r"^[$]*(\d+)[-\s$]*(\d+)$")
        .astype("float")
        .rename({0: f"{col}_lower", 1: f"{col}_upper"}, axis="columns")
    )
#      Age Ethnicity Approximate Household Income  Age_lower  Age_upper  \
# 0  45-54     White          $175,000 - $199,999       45.0       54.0   
# 1  35-44     White            $75,000 - $99,999       35.0       44.0   
# 2  45-54     White            $25,000 - $49,999       45.0       54.0   
# 3  45-54     White            $50,000 - $74,999       45.0       54.0   
# 4  55-64     White                          NaN       55.0       64.0   
# 
#    Approximate Household Income_lower  Approximate Household Income_upper  
# 0                            175000.0                            199999.0  
# 1                             75000.0                             99999.0  
# 2                             25000.0                             49999.0  
# 3                             50000.0                             74999.0  
# 4                                 NaN                                 NaN

网友

2楼 · 编辑于 2024-09-30 01:33:12

在本例中，我建议根据字符串的格式为每种类型的类别设置“手动”转换。例如，对于账龄箱：

age = {0: '45-54', 1: '35-44', 2: '45-54', 3: '45-54', 4: '55-64'}
age_bins = {key: [int(age[key].split('-')[0]), int(age[key].split('-')[1])] for key in age}

{0: [45, 54], 1: [35, 44], 2: [45, 54], 3: [45, 54], 4: [55, 64]}

相关问题更多 >

编程相关推荐

热门问题

热门文章