基于数据帧中的位置计算元素

2024-09-24 22:32:34 发布

您现在位置:Python中文网/ 问答频道 /正文

下面,我有一个表,其中TST1TST5列不能取任何值或以下值之一:NOT_DONE{}{}{}{}{}{}{}

我需要计算下表中验证的元素(行)数量

当最右边的值介于30和50之间时(由5、so 30、35、40…分隔),则认为元素已验证。这意味着,如果该行对所有TST1TST5都没有值,则不计算任何值。如果在NOT_DONE{}或UNTESTED的左侧找到数值,则不会验证该数值

换句话说,我需要从右向左数一行

例如,从下表中,只有6个元素被视为已验证

最后,我需要计算其中有多少属于A组或B组

我解决这个问题的最初想法是创建一个包含所有已验证元素的新列,但我真的不知道如何做到这一点

我正在使用python 2.7和pandas 0.24.2。我是新手,非常感谢您的帮助和指导

+-------+----------+----------+----------+--------+----------+
| Group | TST1     | TST2     | TST3     | TST4   | TST5     |
+-------+----------+----------+----------+--------+----------+
| A     |          | NOT_DONE |          |        | 50       |
+-------+----------+----------+----------+--------+----------+
| A     |          |          | 35       |        |          |
+-------+----------+----------+----------+--------+----------+
| B     |          |          |          |        |          |
+-------+----------+----------+----------+--------+----------+
| A     |          |          | INCOMP   |        |          |
+-------+----------+----------+----------+--------+----------+
| B     | UNTESTED |          | 50       | INCOMP |          |
+-------+----------+----------+----------+--------+----------+
| B     |          |          |          |        |          |
+-------+----------+----------+----------+--------+----------+
| B     |          | 30       |          |        |          |
+-------+----------+----------+----------+--------+----------+
| A     |          | INCOMP   | 40       |        |          |
+-------+----------+----------+----------+--------+----------+
| B     |          |          |          |        | UNTESTED |
+-------+----------+----------+----------+--------+----------+
| A     |          |          |          |        |          |
+-------+----------+----------+----------+--------+----------+
| B     |          | INCOMP   |          |        |          |
+-------+----------+----------+----------+--------+----------+
| A     |          |          |          |        |          |
+-------+----------+----------+----------+--------+----------+
| B     |          | 50       |          |        |          |
+-------+----------+----------+----------+--------+----------+
| B     |          |          | UNTESTED | 35     | NOT_DONE |
+-------+----------+----------+----------+--------+----------+
| B     |          |          |          |        |          |
+-------+----------+----------+----------+--------+----------+
| A     |          | 40       |          | INCOMP |          |
+-------+----------+----------+----------+--------+----------+
| A     |          |          |          | 30     |          |
+-------+----------+----------+----------+--------+----------+
| B     |          |          |          |        |          |
+-------+----------+----------+----------+--------+----------+
| B     |          | NOT_DONE |          | 30     | NOT_DONE |
+-------+----------+----------+----------+--------+----------+

编辑: 这是我尝试过的,但它统计所有表示数值的行,而不是最右边的值为数值的行。我真的不知道如何选择从正确的开始

    filter1 = df.loc[:, 'TST1':'TST5']\
        .apply(lambda x: x.astype(str).str.match(r'\d+\.*\d*'), axis=0)\
        .any(axis=1)
    number_validated = filter1.sum()
    print "Number of validated items: ", number_validated

预期输出应该只是一个简短的文本摘要:

Number of validated items: 5
Number of group A validated items: 4
Number of group B validated items: 2

Tags: of元素numbernotitems数值donestr
1条回答
网友
1楼 · 发布于 2024-09-24 22:32:34

另一个选项,在python 2.7.18和pandas 0.24.2上测试(尽管在python 3中工作良好):

  1. 使用^{}提取最右边的值,并使用^{}强制将其转换为数字:

    rightmost = df.filter(like='TST').ffill(axis='columns').iloc[:, -1]
    rightmost = pd.to_numeric(rightmost, errors='coerce')
    
    # 0      NaN
    # 1     35.0
    # 2      NaN
    # 3      NaN
    # 4      NaN
    # 5      NaN
    # 6     30.0
    # 7     40.0
    # 8      NaN
    # 9      NaN
    # 10     NaN
    # 11     NaN
    # 12    50.0
    # 13     NaN
    # 14     NaN
    # 15     NaN
    # 16    30.0
    # 17     NaN
    # 18     NaN
    # Name: TST5, dtype: float64
    
  2. 然后^{}检查Group是否为^{}30和50(包括):

    valid = rightmost.groupby(df.Group).apply(
        lambda g: g.between(30, 50, inclusive='both').sum()
    ).to_frame('Valid')
    
    #        Valid
    # Group       
    # A          3
    # B          2
    

相关问题 更多 >