Python在数据帧中编码基因组数据

import re import numpy as np import pandas as pd from sklearn.preprocessing import LabelEncoder def string_to_array(my_string): my_string = my_string.lower() my_string = re.sub('[^acgt]', 'z', my_string) my_array = np.array(list(my_string)) return my_array label_encoder = LabelEncoder() label_encoder.fit(np.array(['a','g','c','t','z'])) def ordinal_encoder(my_array): integer_encoded = label_encoder.transform(my_array) float_encoded = integer_encoded.astype(float) float_encoded[float_encoded == 0] = 0.25 # A float_encoded[float_encoded == 1] = 0.50 # C float_encoded[float_encoded == 2] = 0.75 # G float_encoded[float_encoded == 3] = 1.00 # T float_encoded[float_encoded == 4] = 0.00 # anything else, z return float_encoded dfpath = 'C:\\Users\\CAAVR\\Desktop\\Ison.csv' dataframe = pd.read_csv(dfpath) df = ordinal_encoder(string_to_array(dataframe[['Genome']].values.tostring())) print(df)

Antibiotic ... Genome 0 isoniazid ... ccctgacacatcacggcgcctgaccgacgagcagaagatccagctc... 1 isoniazid ... gggggtgctggcggggccggcgccgataaccccaccggcatcggcg... 2 isoniazid ... aatcacaccccgcgcgattgctagcatcctcggacacactgcacgc... 3 isoniazid ... gttgttgttgccgagattcgcaatgcccaggttgttgttgccgaga... 4 isoniazid ... ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcgg...

1条回答

网友

1楼 · 发布于 2024-09-26 20:51:55

我认为LabelEncoder不是你想要的。这是一个简单的转换，我建议直接进行。从查找您的碱基对映射开始：

lookup = {
  'a': 0.25,
  'g': 0.50,
  'c': 0.75,
  't': 1.00
  # z: 0.00
}

然后将查找应用于“Genome”列的值。values属性将以ndarray的形式返回结果数据帧

dataframe['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values

相关问题更多 >

编程相关推荐

热门问题

热门文章