从列中的字符串中删除不需要的部分问题的回答

从列中的字符串中删除不需要的部分

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<blockquote> <h2>How do I remove unwanted parts from strings in a column?</h2> </blockquote> <p>在最初的问题发布6年后，pandas现在拥有大量的“矢量化”字符串函数，可以简洁地执行这些字符串操作。</p> <p>这个答案将探索其中一些字符串函数，提出更快的替代方案，并在最后进行计时比较。</p> <hr/> <h3><strong><a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html" rel="noreferrer">^{<cd1>}</a></strong></h3> <p>指定要匹配的子字符串/模式，以及要用其替换的子字符串。</p> <pre><code>pd.__version__ # '0.24.1' df time result 1 09:00 +52A 2 10:00 +62B 3 11:00 +44a 4 12:00 +30b 5 13:00 -110a </code></pre> <p/> <pre><code>df['result'] = df['result'].str.replace(r'\D', '') df time result 1 09:00 52 2 10:00 62 3 11:00 44 4 12:00 30 5 13:00 110 </code></pre> <p>如果需要将结果转换为整数，可以使用<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.astype.html" rel="noreferrer">^{<cd2>}</a></p> <pre><code>df['result'] = df['result'].str.replace(r'\D', '').astype(int) df.dtypes time object result int64 dtype: object </code></pre> <p>如果不想就地修改<code>df</code>，请使用<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html" rel="noreferrer">^{<cd4>}</a>：</p> <pre><code>df2 = df.assign(result=df['result'].str.replace(r'\D', '')) df # Unchanged </code></pre> <hr/> <h3><strong><a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html" rel="noreferrer">^{<cd5>}</a></strong></h3> <p>用于提取要保留的子字符串。</p> <pre><code>df['result'] = df['result'].str.extract(r'(\d+)', expand=False) df time result 1 09:00 52 2 10:00 62 3 11:00 44 4 12:00 30 5 13:00 110 </code></pre> <p>对于<code>extract</code>，需要指定至少一个捕获组。<code>expand=False</code>将返回一个系列，其中包含从第一个捕获组捕获的项。</p> <hr/> <h3><strong><a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.split.html" rel="noreferrer">^{<cd8>}</a></strong>和<strong><a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.get.html" rel="noreferrer">^{<cd9>}</a></strong></h3> <p>拆分工作假设所有字符串都遵循这种一致的结构。</p> <pre><code># df['result'] = df['result'].str.split(r'\D').str[1] df['result'] = df['result'].str.split(r'\D').str.get(1) df time result 1 09:00 52 2 10:00 62 3 11:00 44 4 12:00 30 5 13:00 110 </code></pre> <p>如果你正在寻找一般的解决方案，不要推荐。</p> <hr/> <blockquote> <p>If you are satisfied with the succinct and readable <code>str</code> accessor-based solutions above, you can stop here. However, if you are interested in faster, more performant alternatives, keep reading.</p> </blockquote> <hr/> <h2><strong>优化：列表理解</h2> <p>在某些情况下，列表理解应该优先于pandas字符串函数。原因是字符串函数本身就很难矢量化（在单词的真正意义上），所以大多数字符串和正则表达式函数只是循环的包装器，开销更大。</p> <p>我写的<a href="https://stackoverflow.com/questions/54028199/for-loops-with-pandas-when-should-i-care">For loops with pandas - When should I care?</a>更详细。</p> <p>可以使用<code>re.sub</code>重新编写<code>str.replace</code>选项</p> <pre><code>import re # Pre-compile your regex pattern for more performance. p = re.compile(r'\D') df['result'] = [p.sub('', x) for x in df['result']] df time result 1 09:00 52 2 10:00 62 3 11:00 44 4 12:00 30 5 13:00 110 </code></pre> <p>这个<code>str.extract</code>示例可以使用<code>re.search</code>的列表理解重新编写</p> <pre><code>p = re.compile(r'\d+') df['result'] = [p.search(x)[0] for x in df['result']] df time result 1 09:00 52 2 10:00 62 3 11:00 44 4 12:00 30 5 13:00 110 </code></pre> <p>如果不匹配或不匹配是可能的，您将需要重新编写上述内容以包括一些错误检查。我是用函数来做的。</p> <pre><code>def try_extract(pattern, string): try: m = pattern.search(string) return m.group(0) except (TypeError, ValueError, AttributeError): return np.nan p = re.compile(r'\d+') df['result'] = [try_extract(p, x) for x in df['result']] df time result 1 09:00 52 2 10:00 62 3 11:00 44 4 12:00 30 5 13:00 110 </code></pre> <p>我们还可以使用列表理解重新编写@eumiro和@MonkeyButter的答案：</p> <pre><code>df['result'] = [x.lstrip('+-').rstrip('aAbBcC') for x in df['result']] </code></pre> <p>而且</p> <pre><code>df['result'] = [x[1:-1] for x in df['result']] </code></pre> <p>同样的规则也适用于处理NaNs等。</p> <hr/> <h2>性能比较</h2> <p><a href="https://i.stack.imgur.com/fjl53.png" rel="noreferrer"><img src="https://i.stack.imgur.com/fjl53.png" alt="enter image description here"/></a></p> <p>使用<a href="https://github.com/nschloe/perfplot" rel="noreferrer">perfplot</a>生成的图。<a href="https://gist.github.com/Coldsp33d/c4a329fd00604e47d513b32e8a25f298" rel="noreferrer">Full code listing, for your reference.</a>下面列出了相关函数。</p> <p>有些比较是不公平的，因为它们利用了OP的数据结构，但你可以从中得到你想要的。需要注意的一点是，每个列表理解函数都比其等价的pandas变体更快或更具可比性。</p> <p><strong>函数</strong></p> <blockquote> <pre><code>def eumiro(df): return df.assign( result=df['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))) def coder375(df): return df.assign( result=df['result'].replace(r'\D', r'', regex=True)) def monkeybutter(df): return df.assign(result=df['result'].map(lambda x: x[1:-1])) def wes(df): return df.assign(result=df['result'].str.lstrip('+-').str.rstrip('aAbBcC')) def cs1(df): return df.assign(result=df['result'].str.replace(r'\D', '')) def cs2_ted(df): # `str.extract` based solution, similar to @Ted Petrou's. so timing together. return df.assign(result=df['result'].str.extract(r'(\d+)', expand=False)) def cs1_listcomp(df): return df.assign(result=[p1.sub('', x) for x in df['result']]) def cs2_listcomp(df): return df.assign(result=[p2.search(x)[0] for x in df['result']]) def cs_eumiro_listcomp(df): return df.assign( result=[x.lstrip('+-').rstrip('aAbBcC') for x in df['result']]) def cs_mb_listcomp(df): return df.assign(result=[x[1:-1] for x in df['result']]) </code></pre> </blockquote>

从列中的字符串中删除不需要的部分

1 个回答

相关Python问题