<p>可以对发布的代码进行两项改进</p>
<ul>
<li>使用dataframe apply而不是使用Python for或while循环来处理每个标题(即非常慢)</li>
<li>使用正则表达式,而不是循环检查字母表中的每个字母,以检查逗号后面是否有字母(也很慢)</li>
</ul>
<p><strong>代码</strong></p>
<pre><code>import re
def clean_title(title):
" Expression to clean title "
# Remove comma when followed by a word letter
return re.sub(r',(\w)', lambda m: m.group(1), title)
# Clean titles
df['title'] = df['title'].apply(clean_title)
</code></pre>
<p><strong>测试</strong></p>
<ul>
<li>生成电影标题和发布年份的数据集列表</li>
<li>标题中包含所需和不需要的逗号</li>
</ul>
<p>不需要的逗号示例:</p>
<ul>
<li>那些人,甚至是武士</li>
</ul>
<p>所需逗号的示例:</p>
<ul>
<li>“我,托尼亚”</li>
</ul>
<p>创建数据集</p>
<pre><code>df = pd.DataFrame({'title':['Lock, Stock and Two Smoking Barrels', 'The S,even Samurai', 'B,onnie and C,lyde', 'Reser,voir Dogs', 'A,irplane!', 'Doct,or Zhiva,go', 'I, Tonya'],
'Year':['1998', '1954', '1967', '1992', '1980', '1965', '2017']})
print(df)
</code></pre>
<p>清理前的数据集</p>
<pre><code> title Year
0 Lock, Stock and Two Smoking Barrels 1998
1 The S,even Samurai 1954
2 B,onnie and C,lyde 1967
3 Reser,voir Dogs 1992
4 A,irplane! 1980
5 Doct,or Zhiva,go 1965
6 I, Tonya 2017
</code></pre>
<p>清理后的数据集</p>
<pre><code> title Year
0 Lock, Stock and Two Smoking Barrels 1998
1 The Seven Samurai 1954
2 Bonnie and Clyde 1967
3 Reservoir Dogs 1992
4 Airplane! 1980
5 Doctor Zhivago 1965
6 I, Tonya 2017
</code></pre>