根据行索引值求和pandas dataframe列中的值

In [1]: import pandas as pd In [2]: from pandas import DataFrame In [3]: df = DataFrame({'domain1.com/url1':[True,False,False,True,False],'domain2.com/url2':[False,True,False,True,True],'domain1.com/url3':[False,False,False,True,False],'domain3.com/url4':[False,True,False,True,False],'domain2.com/url5':[False,True,False,True,True]}, index=['domain1.com/url1','domain2.com/url2','domain1.com/url3','domain3.com/url4','domain2.com/url5']) In [4]: df Out[4]: domain1.com/url1 domain1.com/url3 domain2.com/url2 \ domain1.com/url1 True False False domain2.com/url2 False False True domain1.com/url3 False False False domain3.com/url4 True True True domain2.com/url5 False False True domain2.com/url5 domain3.com/url4 domain1.com/url1 False False domain2.com/url2 True True domain1.com/url3 False False domain3.com/url4 True True domain2.com/url5 True False

In [9]: df_t = df.T In [10]: df_t[ filter(lambda x: x.split('/')[0] != df_t.index.map(lambda x: x.split('/')[0]), list(df_t)) ].sum(axis=0) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-10-279439127551> in <module>() ----> 1 df_t[ filter(lambda x: x.split('/')[0] != df_t.index.map(lambda x: x.split('/')[0]), list(df_t)) ].sum(axis=0) ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

2条回答

网友

1楼 · 编辑于 2024-10-04 05:28:40

索引/a列创建对。过滤掉相同域的域并用False填充。然后将轴1和加到轴0和上。在

def domain(x):
    return x.str.extract(r'([^/]+)', expand=False)

dfi = df.stack().index.to_series().apply(lambda x: pd.Series(x, ['u1', 'u2']))
keep = domain(dfi.u1) != domain(dfi.u2)
df1 = df.stack().ix[keep].unstack().fillna(False)

df1.sum(0) + df1.sum(1)

domain1.com/url1    1
domain1.com/url3    1
domain2.com/url2    2
domain2.com/url5    1
domain3.com/url4    5
dtype: int64

网友

2楼 · 编辑于 2024-10-04 05:28:40

不是很喜欢熊猫，但是。。。可以迭代元素

In [40]: def same_domain(url1, url2):
    return url1.split('/')[0] == url2.split('/')[0]


In [41]: def clear_inner_links(df):
for r in df.index:
    for c in df.columns:
        if(same_domain(r,c)):
            df.loc[r,c] = False
return df

那就

^{pr2}$

基准：

In [35]: new_df.shape
Out[35]: (500, 500)

In [36]: %timeit clear_inner_links(new_df)
1 loop, best of 3: 956 ms per loop

更多熊猫方式：

In [102]: def same_domain(url1, url2):
   .....:         return url1.split('/')[0] == url2.split('/')[0]
   .....: 

In [103]: def apply_criterion(s):
   .....:         s[s.index.map(lambda x: same_domain(x,s.name))] = False
   .....:     

In [104]: def clear_inner_links2(df):
   .....:         df.apply(apply_criterion, axis=0)
   .....:         return df
   .....: 

In [105]: new_df.shape
Out[105]: (500, 500)

In [106]: %timeit clear_inner_links2(new_df)
1 loop, best of 3: 929 ms per loop

对于1000多个数据帧，第二个解决方案比第一个方案（或piRSquared的方案慢50倍左右）表现出更好的性能。在

相关问题更多 >

编程相关推荐

热门问题

热门文章