使用列表创建新列

2024-10-06 12:16:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试创建一个包含城市名称的新列。我还有一个列表,其中包含所需的城市名称以及在不同列名下包含城市名称的CSV文件

我试图做的是检查列表中的城市名称是否存在于CSV文件的特定列中,并将该特定城市名称填入新列“城市”

我的代码是:

 
 
import pandas as pd
import numpy as np
 
City_Name_List = ['Amsterdam', 'Antwerp', 'Brussels', 'Ghent', 'Asheville', 'Austin', 'Boston', 'Broward County', 
                  'Cambridge', 'Chicago', 'Clark County Nv', 'Columbus', 'Denver', 'Hawaii', 'Jersey City', 'Los Angeles', 
                  'Nashville', 'New Orleans', 'New York City', 'Oakland', 'Pacific Grove', 'Portland', 'Rhode Island', 'Salem Or', 'San Diego']
 
 
data = {'host_identity_verified':['t','t','t','t','t','t','t','t','t','t'],
      'neighbourhood':['Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands', 'NaN',
                       'Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands',
                        'Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands', 'NaN',
                        'Amsterdam, North Holland, Netherlands', 'Amsterdam, North Holland, Netherlands'],
      'neighbourhood_cleansed':['Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West',
                                'Oostelijk Havengebied - Indische Buurt', 'Centrum-Oost', 'Centrum-West', 'Centrum-West', 'Centrum-West'],
     'neighbourhood_group_cleansed': ['NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN','NaN'],
      'latitude':[ 52.36575, 52.36509, 52.37297, 52.38761, 52.36719, 52.36575, 52.36509, 52.37297, 52.38761, 52.36719]}
 
df = pd.DataFrame(data)
 
 
df['City']  = [x for x in City_Name_List if x in df.loc[:,'host_identity_verified':'latitude'].values][0]

当我运行代码时,我收到以下消息:

Traceback (most recent call last):
  File "C:/Users/YAZAN/PycharmProjects/Yazan_Work/try.py", line 63, in <module>
    df['City'] = [x for x in City_Name_List if x in df.loc[:,'host_identity_verified':'latitude'].values][0]
IndexError: list index out of range

这是因为面对阿姆斯特丹市的数据后面紧跟着其他词

我希望我的输出如下:

0    Amsterdam
1    Amsterdam
2    Amsterdam
3    Amsterdam
4    Amsterdam
5    Amsterdam
6    Amsterdam
7    Amsterdam
8    Amsterdam
9    Amsterdam
Name: City, dtype: object
 
 

我坚持不懈地试图解决这个问题。我试图使用endswithstartswith、regex,但没有用。我可能两种方法都用错了。我希望有人能帮助我


Tags: namein名称hostcitydfnanidentity
3条回答

使用^{}

df['City'] = df.apply(
    lambda row: [x if x in row.loc['neighbourhood'] for x in City_Name_List][0],
    axis=1
)

执行上述操作后,df['city']将包含一个城市(通过将其包含在City_Name_List中定义),如果在每行的'neighbourhood'列中找到一个城市

改良溶液

您可以更明确地指定City应该填充在每行的'neighbourhood'字段中第一次出现,之前的第一个子字符串上。如果'neighbourhood'列在结构上可靠地统一,这可能是一个好主意,因为它有助于缓解由类似命名的城市、作为City_Name_List中其他城市的子串的城市等引起的任何不必要的行为

df['City'] = df.apply(
    lambda row: [x if x in row.loc['neighbourhood'].split(',')[0] for x in City_Name_List][0],
    axis=1
)

注意:上述解决方案只是您如何解决所遇到问题的示例。它们没有考虑异常、边缘情况等的正确处理。您应该在代码中注意考虑这些因素

df['City'] = df['neighbourhood'].apply(lambda x: [i for i in x.split(',') if i in City_Name_List])
df['City'] = df['City'].apply(lambda x: "" if len(x) == 0 else x[0])

问题在于,当您说x in df.loc[]时,您并不是在检查城市名称是否在每个特定字符串中,而是检查城市名称是否在整个序列中,而事实并非如此。你需要的是这样的东西:

df['city'] = [x if x in City_Name_list else '' for x[0] in df['neighbourhood'].str.split(',')]

这将沿逗号拆分df['Neighbourt']中的每一行,并返回第一个值,然后检查该值是否在城市名称列表中,如果是,则将其放入“城市”系列中

相关问题 更多 >