合并Pandas数据框,如果字符串df2.domain出现在df.u中

2024-07-05 11:24:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个数据帧:df

ID   url
111   vk.com/audio/12353546
222   twitter.com/lenad
333   avito.ru/phones
333   facebook.ru/chats

还有另一个df2

domain   Maincategory   Subcategory
vk.com    Entertainment   Social Network
twitter.com    Entertainment   Social Network
facebook.com   Entertainment   Social Network
avito.com     Online shop      Buys
vk.com/audio    Entertainment   Social Network Music

我需要做以下工作: 如果df2.domain.values.tolist()中的某个字符串出现在df.url中,请添加df2MaincategorySubcategory)中的df列并获得:

ID   url    Maincategory    Subcategory
111   vk.com/audio/12353546    Entertainment   Social Network Music
222   twitter.com/lenad    Entertainment   Social Network
333   avito.ru/phones    Online shop      Buys
333   facebook.ru/chats    Entertainment   Social Network

我有办法做,但时间太长了。你知道吗

mapping = dict(df2.set_index('domain')['Maincategory'])
def map_to_substring(x):
    for key in mapping.keys():
        if key in x:
            return mapping[key]
    return 'None'

df['Maincategory'] = df.url.apply(lambda x: map_to_substring(x))

mapping1 = dict(df2.set_index('domain')['Subcategory'])
def map_to_substring1(x):
    for key in mapping1.keys():
        if key in x:
            return mapping1[key]
    return 'None'

df['Subcategory'] = df.url.apply(lambda x: map_to_substring1(x))

我该如何改进?你知道吗


Tags: tokeycomurlmapdfdomainru
2条回答

我的理解是你想从df得到id并把它放在df2上?你知道吗

有很多方法可以做到这一点。您可以合并df:

pd.merge(df2,df, on='url')

但是,如果“url”上有重复的值,则可能会导致问题

您还可以映射

df2['id'] = df2['url'].map(lambda x: df.loc[df['url']==x, 'ID'].iloc[0])

希望有帮助,和平

import pandas as pd
df1 = pd.DataFrame({'ID': ['111', '222', '333', '333'],'url':['vk.com/audio/12353546','twitter.com/lenad','avito.ru/phones','facebook.ru/chats']})
print "  original df1  "
print df1
df2 = pd.DataFrame({
                'Maincaregory':['Entertainment','Entertainment','Entertainment','Online shop','Entertainment'],
                'Subcategory':['Social Network','Social Network','Social Network','Buys','Social Network Music'],
               'domain':  ['vk.com','twitter.com','facebook.com','avito.com','vk.com/audio']})
print "\n  original df2  "                    
print df2
row = df1.shape[0]
dname =[]
for x in range(row):
    name = df1.iloc[x]['url'].split(".")
    if name[0] == 'vk':
        cat = df1.iloc[x]['url'].split("/")
        dname.append(cat[0]+"/"+cat[1])
    else:
        dname.append(name[0]+".com")

df1['domain']=dname
print "\n  for merge df1  "
print df1

df3 = pd.merge(df1,df2, how='inner',on=['domain'])
df3= df3.drop('domain',1)
print "\n  what you want  "
print df3

结果:

  original df1  
    ID                    url
0  111  vk.com/audio/12353546
1  222      twitter.com/lenad
2  333        avito.ru/phones
3  333      facebook.ru/chats

  original df2  
    Maincaregory           Subcategory        domain
0  Entertainment        Social Network        vk.com
1  Entertainment        Social Network   twitter.com
2  Entertainment        Social Network  facebook.com
3    Online shop                  Buys     avito.com
4  Entertainment  Social Network Music  vk.com/audio

  for merge df1  
    ID                    url        domain
0  111  vk.com/audio/12353546  vk.com/audio
1  222      twitter.com/lenad   twitter.com
2  333        avito.ru/phones     avito.com
3  333      facebook.ru/chats  facebook.com

  what you want  
    ID                    url   Maincaregory           Subcategory
0  111  vk.com/audio/12353546  Entertainment  Social Network Music
1  222      twitter.com/lenad  Entertainment        Social Network
2  333        avito.ru/phones    Online shop                  Buys
3  333      facebook.ru/chats  Entertainment        Social Network

相关问题 更多 >