在数据框中查找字符串并在新列中存储新值

2024-10-03 17:17:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在创建一个脚本,该脚本接受一个csv文件,该文件的列组织和列名称未知。但是我知道只有一列包含str'rs'和'del'出现的一些值

我需要创建一个额外的列(称为“Type”)并在找到“rs”的行中存储“dbsnp”,在找到“del”的行中存储“deletion”。如果未找到str,请将列类型中的此行保留为空

作为示例,我提供以下df:

Data = {'Number': ['Mukul', 'Rohan', 'Mayank', 
                  'Shubham', 'Aakash'], 
          
        'Location': ['Saharsanpur', 'MERrs', 'rsAdela', 
                     'aaaadelaa', 'aaa'], 
          
        'Pay': [25000, 30000, 35000, 40000, 45000]} 
  
df = pd.DataFrame(Data)
print(df)

 Name     Location    Pay
0    Mukul  Saharsanpur  25000
1    Rohan        MERrs  30000
2   Mayank      rsAdela  35000
3  Shubham    aaaadelaa  40000
4   Aakash          aaa  45000

我一直在尝试这样的事情

df["type"] = df["Name"].str.extract("rs")[0] 
# and then do some replace

但是我的一个问题是,我不知道专栏的名字,也不知道职位

期望输出

 Name     Location    Pay       type
0    Mukul  Saharsanpur  25000 dbsnp
1    Rohan        MERrs  30000 dbsnp
2   Mayank      rsAdela  35000 dbsnp
3  Shubham    aaaadelaa  40000 deletion
4   Aakash          aaa  450

下一个for循环解决了未知列的问题,但现在我需要解决在值中标识str的问题

如何在if条件下使用str.contains(“rs”)

for index, row in df[:3].iterrows():
    for i in range(len(df.columns)): 
        if row[i] == 5:
            print(row.index[i])


Tags: dflocationrsstraaadbsnpmayankrohan
3条回答

你可以不用循环来做。这里有一个方法。可以使用applymap搜索所有列

import pandas as pd
data = {'Number': ['Mukul', 'Rohan', 'Mayank', 
                  'Shubham', 'Aakash'], 
          
        'Location': ['Saharsanpur', 'MERrs', 'rsAdela', 
                     'aaaadelaa', 'aaa'], 
          
        'Pay': [25000, 30000, 35000, 40000, 45000]} 
  
df = pd.DataFrame(data)

df['rs'] = df.astype(str).applymap(lambda x: 'rs' in x).any(1)
df['del'] = df.astype(str).applymap(lambda x: 'del' in x).any(1)

df['type']=''
df.loc[df['rs'] == True, 'type'] = 'dbsnp'
df.loc[df['del'] == True, 'type'] = 'deletion'

df = df.drop(columns=['rs','del'])
print (df)

根据表中的数据,rsAdela既有rs又有del。由于我首先应用rs,然后应用del,因此该行被标记为deletion。您可以选择交换顺序,以决定是将值保留为dbsnp还是deletion

该代码处理所有列,而不考虑数据类型

上述数据的输出为:

    Number     Location    Pay      type
0    Mukul  Saharsanpur  25000     dbsnp
1    Rohan        MERrs  30000     dbsnp
2   Mayank      rsAdela  35000  deletion
3  Shubham    aaaadelaa  40000  deletion
4   Aakash          aaa  45000          

此示例可以帮助您:

import pandas as pd
import random

inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = pd.DataFrame(inp)

df['newColumn'] = ""
yourCondition = True
for i in range(len(df)):
    # put your condition here
    #
    # if df['Name'].values[i].find("rs") != -1:
    #    df['newColumn'].values[i] = "Found!"
    # else:
    #    df['newColumn'].values[i] = "Not Found!"
    if (yourCondition):
        # now you can update what you want
        df['newColumn'].values[i] = random.randint(0,9)

print(df)

输出

   c1   c2 newColumn
0  10  100         5
1  11  110         7
2  12  120         2

您可以添加如下新列:df['newColumn'] = ""
然后像这样迭代和数据帧:for i in range(len(df)): 然后您可以像这样访问元素:df['newColumn'].values[i]

您可以使用str.contains,正如@Joe Ferndz所说:

# create filter based on your criteria
msk1 = df['Location'].str.contains('rs')
msk2 = df['Location'].str.contains('del')

# only make changes to those that fit the criteria
df.loc[msk1, 'Type'] = 'dbsnp'
df.loc[msk2, 'Type'] = 'deletion'

# if you wish to fill NaN with empty string
df['Type'] = df['Type'].fillna('')

相关问题 更多 >