我有一个非常大的汽车数据帧。像这样:
Text Terms
0 Car's model porche year in data [tech, window, tech]
1 we’re simply making fossil fuel cars [brakes, window, Italy, nice]
2 Year of cars Ferrari to make [Detroit, window, seats, engine]
3 reading the specs of Ferrari file [tech, window, engine, v8, window]
4 likelihood Porche in the car list [from, wheel, tech]
还有,这些:
term_list = ['tech', 'engine', 'window']
cap_list = ['Ferrari', 'porche']
term_cap_dict = {'Ferrari': ['engine', 'window'], 'Porche': ['tech']}
我想要一个结果数据帧,它计算每个术语(在术语表中)出现在“术语”列中的次数-仅当“Text”列包含相应的“key”(来自术语表中)时才计算。例如:术语“tech”的条件计数(给定Porche)=3(因为相应的“Text”中有“Porche”。。。尽管如此,“tech”出现的总次数是4次)。如果计数为0或不存在条件文本,则条件计数默认为0。所需输出:
Terms Cap ConditionalCount
0 engine Ferrari 2
1 engine porche 0
2 tech Ferrari 0
3 tech porche 3
4 window Ferrari 3
5 window porche 1
以下是我到目前为止得到的结果(只是计算TotalCount…而不是条件计数):
term_cap_dict = {k.lower(): list(map(str.lower, v)) for k, v in term_cap_dict.items()}
terms_counter = Counter(chain.from_iterable(df['Terms']))
terms_series = pd.Series(terms_counter)
terms_df = pd.DataFrame({'Term': terms_series.index, 'TotalCount': terms_series.values})
df1 = terms_df[terms_df['Term'].isin(term_list)]
product_terms = product(term_list, cap_list)
df_cp = pd.DataFrame(product_terms, columns=['Terms', 'Capability'])
dff = df_cp.set_index('Terms').combine_first(df1.set_index('Term')).reset_index()
dff.rename(columns={'index': 'Terms'}, inplace=True)
它给出了TotalCount:
Terms Capability TotalCount
0 engine Ferrari 3.0
1 engine porche 3.0
2 tech Ferrari 4.0
3 tech porche 4.0
4 window Ferrari 4.0
5 window porche 4.0
从这一点开始,我不知道如何计算条件计数。任何建议都将不胜感激
数据框到目录()
{'Title': {0: "Car's model porche year in data",
1: 'we’re simply making fossil fuel cars',
2: 'Year of cars Ferrari to make',
3: 'reading the specs of Ferrari file',
4: 'likelihood Porche in the car list'},
'Terms': {0: ['tech', 'window', 'tech'],
1: ['brakes', 'engine', 'Italy', 'nice'],
2: ['Detroit', 'window', 'seats', 'engine'],
3: ['tech', 'window', 'engine', 'v8', 'window'],
4: ['from', 'wheel', 'tech']}}
更新:
输出:
IIUC,试试这个:
输出:
相关问题 更多 >
编程相关推荐