<p>好的,首先,您不想在数据帧中的行上循环。这些行被设计为并行处理。了解这一点有些费劲,但一旦定义了一些行级操作并将其应用于大型数据帧,就会变得更平滑。(行上循环的问题是<em>速度</em>中的一个问题。它有时在调试或玩具问题中很有用,但现代计算硬件试图尽可能地并行计算。数据帧利用这一点一次处理所有行,而不是在循环中单独处理它们。)</p>
<p>要进行转换,您需要定义一个自定义函数来对每一行进行操作。然后将该自定义函数传递给dataframe,并告诉它<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html" rel="nofollow noreferrer">apply</a>将该行级函数传递给一列,以便生成一个新列</p>
<p>因此,这里有一个可能的函数让您开始:</p>
<pre><code>def peptide_score(peptide_string):
'''Returns a numerical score given a sequence of peptide characters.'''
# Replace the values in this dict (dictionary / map) with whatever values you need
amino_acid_scores = {
'A': 0.1,
'C': 1.4,
'G': 0.32342,
'T': -0.23,
'U': 74.22
}
# This is called a "list comprehension." It's great for transforming sequences.
score_list = [amino_acid_scores[character] for character in peptide_string]
return sum(score_list)
# I'm assuming your pre-existing dataframe is called "gluc_dataframe" and that the
# column with your strings is called "Peptide". Output scores will be stored in a new
# column, "score". Replace those names with whatever fits.
gluc_dataframe['score'] = gluc_dataframe['Peptide'].apply(peptide_score)
</code></pre>
<p>如果您有很多要忽略的字符(空格、标点符号等),可以将列表中的<code>amino_acid_scores[character]</code>替换为<code>amino_acid_scores.get(character, 0.0)</code></p>