如何计算panda表的一列中逗号分隔的值？

businessdata = ['Name of Location','Address','City','Zip Code','Website','Yelp', '# Reviews', 'Yelp Rating Stars','BarRestStore','Category', 'Price Range','Alcohol','Ambience','Latitude','Longitude'] business = pd.read_table('FL_Yelp_Data_v2.csv', sep=',', header=1, names=businessdata) print '\n\nBusiness\n' print business[:6]

Category # Categories French 1 Adult Entertainment , Lounges , Music Venues 3 American (New) , Steakhouses 2 American (New) , Beer, Wine & Spirits , Gastropubs 4 Chicken Wings , Sports Bars , American (New) 3 Japanese 1

business = pd.read_table('FL_Yelp_Data_v2.csv', sep=',', header=1, names=businessdata, skip_blank_lines=True) #business = pd.read_csv('FL_Yelp_Data_v2.csv') business['Category'].str.split(',').apply(len) #not sure where to declare the df part in the suggestions that use it. print business[:6]

3条回答

网友

1楼 · 编辑于 2024-10-04 01:29:28

假设Category实际上是一个列表，那么可以使用apply（根据@EdChum的建议）：

business['# Categories'] = business.Category.apply(len)

如果没有，首先需要解析它并将其转换为一个列表。在

^{pr2}$

你能展示一下这个专栏的输出示例吗（包括正确的引文）？在

p.S.@EdChum谢谢你的建议。我很感激他们。我相信列表理解方法可能更快，根据我用30k+行数据测试的一些文本数据样本：

%%timeit
df.Category.str.strip().str.split(',').apply(len)
10 loops, best of 3: 44.8 ms per loop

%%timeit
df.Category.map(lambda x: [i.strip() for i in x.split(",")])
10 loops, best of 3: 28.4 ms per loop

即使考虑到len函数调用：

%%timeit
df.Category.map(lambda x: len([i.strip() for i in x.split(",")]))
10 loops, best of 3: 30.3 ms per loop

网友

2楼 · 编辑于 2024-10-04 01:29:28

使用pd.read_csv文件要使输入更容易：

business = pd.read_csv('FL_Yelp_Data_v2.csv')

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

创建后，您可以创建一个函数，以“，”分隔categories列，并计算结果列表的长度。使用lambda并涂抹。在

网友

3楼 · 编辑于 2024-10-04 01:29:28

这是有效的：

business['# Categories'] = business['Category'].apply(lambda x: len(x.split(',')))

如果需要处理NA等，可以传递一个更精细的函数，而不是lambda。在

相关问题更多 >

编程相关推荐

热门问题

热门文章