擅长:python、mysql、java
<p>此解决方案将保留所有ASCII和拉丁-1字符,即<a href="https://en.wikipedia.org/wiki/List_of_Unicode_characters" rel="nofollow noreferrer">this list</a>中U+0000和U+00FF之间的字符。对于扩展拉丁语加希腊语,请使用<code>< 1024</code>:</p>
<pre class="lang-py prettyprint-override"><code>df = pd.DataFrame({'messages': ['Länder 🇩🇪❤️', 'Hello! 👋']})
filter_char = lambda c: ord(c) < 256
df['messages'] = df['messages'].apply(lambda s: ''.join(filter(filter_char, s)))
</code></pre>
<p>结果:</p>
<pre class="lang-none prettyprint-override"><code> messages
0 Länder
1 Hello!
</code></pre>
<p>注意:例如,这不适用于日文文本。另一个问题是心脏的“表情符号”实际上是一个<a href="https://en.wikipedia.org/wiki/Dingbat#Dingbats_Unicode_block" rel="nofollow noreferrer">Dingbat</a>,所以我不能简单地过滤Unicode的<a href="https://en.wikipedia.org/wiki/Unicode_block#List_of_blocks" rel="nofollow noreferrer">Basic Multilingual Plane</a>,哦,好吧</p>