在Pandas DataFram中查找和计算字符串值

2024-05-20 02:44:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个pandas数据框,其中有我想要计数的字符串值。我要计算的字符串是“同义的”和“非同义的”。我发现这些字符串位于第23、24、25、29和31列。在

第23列如下:

15392                                               OAnc=C
15393                                                  114
15394    EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gc...
15395                                       0/0:30:90.29:0
15396                                            pSC=0.441
15397                                            pSC=0.030
15398                                              bSC=884
...

第24列如下:

^{pr2}$

第25列如下:

13062                                                      C
13063                                                      C
13064    EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING...
13065    EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING...
13066                                                 CAnc=G
13067                                                      C
13068                                                      G

第29列如下:

15688                                                  0:0
15689                                                  0:0
15690                                                  NaN
15691    EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|...
15692                                                  0:0
15693                                                  NaN
15694                                                  0:1

第31列是这样的:

3081                                                   45
3082                                               1432:0
3083                                                  0:0
3084    SYNONYMOUS_CODING(LOW|SILENT|acG/acA|T473|482|...
3085                                                    9
3086                                                  0:0
3087                                                  0:0

我想知道如何通过这五列来计算字符串“SYNONYMOUS_CODING”或“NON_SYNONYMOUS_CODING”出现的次数而不重复计数。因为可能有些行中这些字符串出现在两个或多个不同的列中。在

谢谢。在

罗德里戈


Tags: 数据字符串pandasnanlow计数同义modifier
2条回答

下面是我所做的,我包括了用于创建数据帧的代码。您可以通过关注main()方法来查看算法

def create_df():
    grid = (
        {'A': ["EXON(MODIFIER||||870|RSPH10B|protein_coding|CO)",
               "NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT)",
               "INTERGENIC(MODIFIER||||||||||1)",
               "DOWNSTREAM(MODIFIER||489|||PMS2||CODING|NR_003)",
               "DOWNSTREAM(MODIFIER||408|||PMS2||CODING|NR_003)"],
         'B': ["FOO",
               "EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Gc",
               "NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aCg/aT)",
               "pSC=0.441",
               "bSC=884"],
         'C': ["BAR",
               "BAR",
               "EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING",
               "EFF=SYNONYMOUS(MODIFIER|||||DKFZp434L192||CODING",
               "EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|"],
         'D': ["EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|",
               "0:0",
               "0:0",
               "EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|",
               "EFF=SYNONYMOUS_CODING(LOW|SILENT|tcC/tcG|S782|"],
        }
    )
    return pd.DataFrame(grid)

def get_masks(df):
    non_syn = pd.DataFrame(index=df.index, columns=df.columns)
    synonymous = pd.DataFrame(index=df.index, columns=df.columns)

    for i in df:
        non_syn[i] = df[i].str.contains("NON_SYNONYMOUS_CODING")
        synonymous[i] = df[i][~non_syn[i]].str.contains("SYNONYMOUS_CODING")

    return non_syn, synonymous.dropna()

def count_unique_truths(df):
    # make unique across rows, and then restore to regular
    df = df.transpose().drop_duplicates().transpose()
    return np.sum(df).sum()

def main():
    df = create_df()
    non_syn, synonymous = get_masks(df)
    non_syn_count = count_unique_truths(non_syn)
    synonymous_count = count_unique_truths(synonymous)
    print(df)
    print("Synonymous Count = {:d}\nNon_Synonymous Count = {:d}".format(int(synonymous_count), int(non_syn_count)))
    df.groupby()

if __name__ == '__main__':
    main()

我可以通过以下方法获得字符串“synonology_CODING”和“NON_synonology_CODING”在每列中出现的次数:

column23 = str(df_test[23])
column24 = str(df_test[24])
column25 = str(df_test[25])
column29 = str(df_test[29])
column31 = str(df_test[31])

count = 0

if "SYNONYMOUS_CODING" in column23:
    print "YES Syn in Column 23"
    count += 1

    print "Count value:"
    print count

if "SYNONYMOUS_CODING" in column24:
    print "YES Syn in Column 24"
    count += 1

    print "Count value:"
    print count

if "SYNONYMOUS_CODING" in column25:
    print "YES Syn in Column 25"
    count += 1

    print "Count value:"
    print count

if "SYNONYMOUS_CODING" in column29:
    print "YES Syn in Column 29"
    count += 1

    print "Count value:"
    print count

if "SYNONYMOUS_CODING" in column31:
    print "YES Syn in Column 31"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column23:
    print "YES Non_Syn in Column 23"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column24:
    print "YES Non_Syn in Column 24"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column25:
    print "YES Non_Syn in Column 25"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column29:
    print "YES Non_Syn in Column 29"
    count += 1

    print "Count value:"
    print count

if "NON_SYNONYMOUS_CODING" in column31:
    print "YES Non_Syn in Column 31"
    count += 1

    print "Count value:"
    print count

但这是高度重复和非Python式的,就像我想要的。。。在

相关问题 更多 >