讨论Python中维度的一些规则

2024-09-30 20:38:03 发布

您现在位置:Python中文网/ 问答频道 /正文

数据如下:

datas = [
    ['/page_1', 1],
    ['/page_1?x=123', 2],
    ['/page_1/subpage_1', 1],
    ['/page_2', 10],
]

我打算对其应用一个自定义groupby操作,结果应该是:

datas = [
    ['/page_1', 4],
    ['/page_2', 10],
]

我应该如何通过Python本身有效地实现它,或者通过Pandas方便地实现它?你知道吗

多谢你了。你知道吗

更重要的是,它可以按以下两个维度进行分组:

#-- raw data
datas = [
    ['/page_1', 'China', 1],
    ['/page_1?x=123', 'China', 2],
    ['/page_1/subpage_1', 'US', 1],
    ['/page_2', 'Britain', 10],
]

#-- expected result
datas = [
    ['/page_1', 'China', 3],
    ['/page_1', 'US', 1],
    ['/page_2', 'Britain', 10],
]

我已经为一维groupby实现了一个场景:

def mergeRowWithSameSuffix(datas):
    curPrefix = None
    curPrefixPV = 0
    curPrefixUV = 0

    rtn = []

    for data in datas:
        pagePathLevel2 = data[0].encode('utf-8').replace("'", "")
        pv = int(data[1])
        uv = int(data[2])

        if not curPrefix:
            curPrefix = pagePathLevel2
            curPrefixPV = pv
            curPrefixUV = uv
        elif pagePathLevel2.startswith(curPrefix+"?") or pagePathLevel2.startswith(curPrefix+"/"):
            curPrefixPV += pv
            curPrefixUV += uv
        else:
            rtn.append([curPrefix, curPrefixPV, curPrefixUV])
            curPrefix = pagePathLevel2
            curPrefixPV = pv
            curPrefixUV = uv

    rtn.append([curPrefix, curPrefixPV, curPrefixUV])

    return rtn

但这显然对二维groupby不起作用。所以我想一定有办法通过熊猫来实现。你知道吗


Tags: datapageuvusgroupbypvchinadatas
2条回答

结合使用dataframe方法和使用正则表达式来提取页面信息的根应该可以做到这一点。你知道吗

# Do imports
import re
import pandas as pd

# Define regular expression to pull out root
xpr = re.compile('/([^/?]+)')
# Define initial dataframe, assuming your 3-column example above
df = pd.DataFrame(datas,columns=['Page','Country','Count'])
# Create a column for the root of the page column by applying a regular expression
df['Root'] = df['Page'].apply(lambda v:re.match(xpr,v).groups(0)[0])

# At this point, dataframe looks like:
#                 Page  Country  Count    Root
# 0            /page_1    China      1  page_1
# 1      /page_1?x=123    China      2  page_1
# 2  /page_1/subpage_1       US      1  page_1
# 3            /page_2  Britain     10  page_2

# Sum over the Root & Country groups
results = df.groupby(['Root','Country']).sum()
#                 Count
# Root   Country       
# page_1 China        3
#        US           1
# page_2 Britain     10

如果到?/的字符长度相等,则可以使用^{}选择带有indexing with str的列:

print df.iloc[:,0].str[:7]
0    /page_1
1    /page_1
2    /page_1
3    /page_2
Name: 0, dtype: object

print df.groupby(df.iloc[:,0].str[:7]).sum().reset_index()
         0   1
0  /page_1   4
1  /page_2  10

或:

print df.groupby([df.iloc[:,0].str[:7], df.iloc[:,1]]).sum().reset_index()
         0        1   2
0  /page_1    China   3
1  /page_1       US   1
2  /page_2  Britain  10

如果长度不相等,请使用^{}选择带有^{}的列:

print df
                   0        1   2
0         /paaaage_1    China   1
1   /paaaage_1?x=123    China   2
2  /page_1/subpage_1       US   1
3            /page_2  Britain  10

xpr = re.compile('/([^/?]+)')
print df.iloc[:,0].str.extract(xpr)
0    paaaage_1
1    paaaage_1
2       page_1
3       page_2

print df.groupby([df.iloc[:,0].str.extract(xpr), df.iloc[:,1]]).sum().reset_index()
           0        1   2
0  paaaage_1    China   3
1     page_1       US   1
2     page_2  Britain  10

相关问题 更多 >