如何将范围字符串(箱)转换为数值,然后与Seaborn可视化一起使用

2024-09-30 01:33:12 发布

您现在位置:Python中文网/ 问答频道 /正文

因此,我正在Jupyter笔记本中使用Python 3.7。我目前正在以从.CSV file导入的Pandas的形式探索一些调查数据。我想通过一些Seaborn可视化进一步探索,然而,数字数据是以年龄箱的形式收集的,使用字符串值

有没有一种方法可以将这些列(AgeApproximate Household Income)转换为数值,然后与Seaborn一起使用?我尝试过搜索,但我的措辞似乎只是返回为具有数字值的列创建年龄箱的方法。我真的在寻找如何将字符串值转换为数值

还有,有没有人能告诉我如何改进我的搜索方法。为这样的事情寻找解决方案的理想措辞是什么

下面是数据帧中的一个示例,使用df.head(5).to_dict(),为匿名目的更改了值

 'Age': {0: '45-54', 1: '35-44', 2: '45-54', 3: '45-54', 4: '55-64'},
 'Ethnicity': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'},
 'Approximate Household Income': {0: '$175,000 - $199,999',
  1: '$75,000 - $99,999',
  2: '$25,000 - $49,999',
  3: '$50,000 - $74,999',
  4: nan},
 'Highest Level of Education Completed': {0: 'Four Year College Degree',
  1: 'Four Year College Degree',
  2: 'Jr College/Associates Degree',
  3: 'Jr College/Associates Degree',
  4: 'Four Year College Degree'},
 '2020 Candidate Choice': {0: 'Joe Biden',
  1: 'Joe Biden',
  2: 'Donald Trump',
  3: 'Joe Biden',
  4: 'Donald Trump'},
 '2016 Candidate Choice': {0: 'Hillary Clinton',
  1: 'Third Party',
  2: 'Donald Trump',
  3: 'Hillary Clinton',
  4: 'Third Party'},
 'Party Registration 2020': {0: 'Independent',
  1: 'No Party',
  2: 'No Party',
  3: 'Independent',
  4: 'Independent'},
 'Registered State for Voting': {0: 'Colorado',
  1: 'Virginia',
  2: 'California',
  3: 'North Carolina',
  4: 'Oregon'}

Tags: 数据方法partyseabornyear形式fourwhite
2条回答

您可以使用一些pandasSeries.str方法

较小的示例数据集:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "Age": {0: "45-54", 1: "35-44", 2: "45-54", 3: "45-54", 4: "55-64"},
        "Ethnicity": {0: "White", 1: "White", 2: "White", 3: "White", 4: "White"},
        "Approximate Household Income": {
            0: "$175,000 - $199,999",
            1: "$75,000 - $99,999",
            2: "$25,000 - $49,999",
            3: "$50,000 - $74,999",
            4: np.nan,
        },
    }
)
#      Age Ethnicity Approximate Household Income
# 0  45-54     White          $175,000 - $199,999
# 1  35-44     White            $75,000 - $99,999
# 2  45-54     White            $25,000 - $49,999
# 3  45-54     White            $50,000 - $74,999
# 4  55-64     White                          NaN

我们可以遍历列列表并应用这些方法来解析pandas.DataFrame中的所有范围:

我们将按顺序使用的方法:

  • ^{}-将逗号替换为零
  • ^{}-从序列regex explained here中提取数字
  • ^{}-将提取的数字转换为floats
  • ^{}-重命名新列
  • ^{}-将提取的数字添加回原始数据帧
for col in ["Age", "Approximate Household Income"]:
    df = df.join(
        df[col]
        .str.replace(",", "", regex=False)
        .str.extract(pat=r"^[$]*(\d+)[-\s$]*(\d+)$")
        .astype("float")
        .rename({0: f"{col}_lower", 1: f"{col}_upper"}, axis="columns")
    )
#      Age Ethnicity Approximate Household Income  Age_lower  Age_upper  \
# 0  45-54     White          $175,000 - $199,999       45.0       54.0   
# 1  35-44     White            $75,000 - $99,999       35.0       44.0   
# 2  45-54     White            $25,000 - $49,999       45.0       54.0   
# 3  45-54     White            $50,000 - $74,999       45.0       54.0   
# 4  55-64     White                          NaN       55.0       64.0   
# 
#    Approximate Household Income_lower  Approximate Household Income_upper  
# 0                            175000.0                            199999.0  
# 1                             75000.0                             99999.0  
# 2                             25000.0                             49999.0  
# 3                             50000.0                             74999.0  
# 4                                 NaN                                 NaN  

在本例中,我建议根据字符串的格式为每种类型的类别设置“手动”转换。例如,对于账龄箱:

age = {0: '45-54', 1: '35-44', 2: '45-54', 3: '45-54', 4: '55-64'}
age_bins = {key: [int(age[key].split('-')[0]), int(age[key].split('-')[1])] for key in age}
{0: [45, 54], 1: [35, 44], 2: [45, 54], 3: [45, 54], 4: [55, 64]}

相关问题 更多 >

    热门问题