在python中拆分包含状态和区域名称的列

2024-10-06 08:54:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从数据框中的下一列创建两个单独的列

0                         State_1
1                            Auburn
2                          Florence
3                      Jacksonville
4                        Livingston
5                        Montevallo
6                              Troy
7                        Tuscaloosa
8                          Tuskegee
9                            state_2
10                        Fairbanks
11                          state_3
12                        Flagstaff
13                            Tempe
14                           Tucson
15                         state_4
16                      Arkadelphia
17                           Conway
18                     Fayetteville
19                        Jonesboro
20                         Magnolia
21                       Monticello
22                     Russellville
23                           Searcy

我希望上面的df看起来像这样:

0  state_1                 Auburn
2  state_1                 Florence
3  state_1                 Jacksonville
4  state_1                Livingston
5  state_1                Montevallo
6  state_1                Troy
7  state_1                Tuscaloosa
8  state_1                Tuskegee
 ...

16 state_4                   Arkadelphia
17 state_4                   Conway
18 state_4                   Fayetteville
19 state_4                   Jonesboro
20 state_4                   Magnolia
21 state_4                   Monticello
22 state_4                   Russellville
23 v                         Searcy

如您所见,我想对数据进行反向透视。我查阅了pd.pivot上的文档,但没有取得任何进展。这是一本国家词典:

states = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada', 'WY': 'Wyoming', 'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland', 'AK': 'Alaska', 'UT': 'Utah', 'OR': 'Oregon', 'MT': 'Montana', 'IL': 'Illinois', 'TN': 'Tennessee', 'DC': 'District of Columbia', 'VT': 'Vermont', 'ID': 'Idaho', 'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii', 'WI': 'Wisconsin', 'MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 'AZ': 'Arizona', 'GU': 'Guam', 'MS': 'Mississippi', 'PR': 'Puerto Rico', 'NC': 'North Carolina', 'TX': 'Texas', 'SD': 'South Dakota', 'MP': 'Northern Mariana Islands', 'IA': 'Iowa', 'MO': 'Missouri', 'CT': 'Connecticut', 'WV': 'West Virginia', 'SC': 'South Carolina', 'LA': 'Louisiana', 'KS': 'Kansas', 'NY': 'New York', 'NE': 'Nebraska', 'OK': 'Oklahoma', 'FL': 'Florida', 'CA': 'California', 'CO': 'Colorado', 'PA': 'Pennsylvania', 'DE': 'Delaware', 'NM': 'New Mexico', 'RI': 'Rhode Island', 'MN': 'Minnesota', 'VI': 'Virgin Islands', 'NH': 'New Hampshire', 'MA': 'Massachusetts', 'GA': 'Georgia', 'ND': 'North Dakota', 'VA': 'Virginia'}

这是我试过的代码。请注意,这是一个令人尴尬的错误尝试(这里几乎是Python新手)

#create new column for states only
df['State'] = 0

#Duplicate above combined column
df['Column_duplicate'] = df['Column']

for i in range(len(df)):
    if (dfl['Column_duplicate'].iloc[i+1] == df['Column'].iloc[i]):
           dfl['State'].iloc[i] = dfl['Column'].iloc[i]

Tags: 数据dfnewcolumnstatetroyilocdfl
2条回答
dfl = (pd.read_csv('university_towns.txt', sep="[|]|(|)", header=None).rename(columns={0:'datamain'}))
    
dfl = dfl['datamain'].str.split("(", n = 1, expand = True)
dfl = dfl.loc[:,[0]].rename(columns={0:'State'})
dfl['RegionName'] = dfl['State'].str.strip()
dfl['State'] = dfl['State'].str.replace(r"[.*\]","").str.strip()
dfl['RN1'] = dfl['RegionName'].str.contains(r"\[.*\]","")
    
    for i in range(len(dfl)):
        if dfl['RN1'].iloc[i] != True:
            dfl['State'].iloc[i] = np.NaN
            
    dfl = dfl.ffill(axis = 0)
    df1

此处的数据:https://en.wikipedia.org/wiki/List_of_college_towns#College_towns_in_the_United_States

请注意,我确信这是一个相当艰巨的方法。总之:ffill()函数是我想要创建state列的函数

您可以使用where屏蔽包含state_的行,然后使用ffill()用这些值填充新列。然后,删除两列上都带有state_的所有行

import pandas as pd

df = pd.read_csv("data.txt", header=None)
print(df)

mark = df[0].where(df[0].str.contains("state_", case=False))
df[1] = mark.ffill()
df = df[df.iloc[:, 0] != df.iloc[:, 1]]

df.columns = ['State', 'StateNum']
df = df[df.columns[::-1]].reset_index(drop=True)

print(df)

来自df的输出

   StateNum         State
0   State_1        Auburn
1   State_1      Florence
2   State_1  Jacksonville
3   State_1    Livingston
4   State_1    Montevallo
5   State_1          Troy
6   State_1    Tuscaloosa
7   State_1      Tuskegee
8   state_2     Fairbanks
9   state_3     Flagstaff
10  state_3         Tempe
11  state_3        Tucson
12  state_4   Arkadelphia
13  state_4        Conway
14  state_4  Fayetteville
15  state_4     Jonesboro
16  state_4      Magnolia
17  state_4    Monticello
18  state_4  Russellville
19  state_4         Searc

相关问题 更多 >