<p>您可以对字符进行迭代,获取代码点,并检查允许的值:</p>
<pre><code>def sanitize(unsafe_str):
allowed_range = set(range(32, 127))
safe_str = ''
for char in unsafe_str:
cp = ord(char)
if cp in allowed_range:
safe_str += char
elif cp == 9:
safe_str += ' ' * 4
return re.sub(r'\s+', ' ', safe_str)
</code></pre>
<p><strong>示例:</strong></p>
<pre><code>In [1042]: unsafe_string = "\u2502\u251cAPPLES\n\t\t\t\t\t\r\r AND \n\nBANANAS"
In [1043]: def sanitize(unsafe_str):
...: allowed_range = set(range(32, 127))
...: safe_str = ''
...: for char in unsafe_str:
...: cp = ord(char)
...: if cp in allowed_range:
...: safe_str += char
...: elif cp == 9:
...: safe_str += ' ' * 4
...: return re.sub(r'\s+', ' ', safe_str)
...:
...:
In [1044]: sanitize(unsafe_string)
Out[1044]: 'APPLES AND BANANAS'
</code></pre>
<p>最后一个<code>re.sub(r'\s+', ' ', safe_str)</code>块是将空白压缩为1。如果您不想这样做,只需执行<code>return safe_str</code>:</p>
<pre><code>In [1046]: def sanitize(unsafe_str):
...: allowed_range = set(range(32, 127))
...: safe_str = ''
...: for char in unsafe_str:
...: cp = ord(char)
...: if cp in allowed_range:
...: safe_str += char
...: elif cp == 9:
...: safe_str += ' ' * 4
...: return safe_str
...:
In [1047]: sanitize(unsafe_string)
Out[1047]: 'APPLES AND BANANAS'
</code></pre>
<hr/>
<p>FWIW,这会在每次运行函数时生成允许的列表,但由于它是一个常量,您可以将它放在模块级别,以便只生成一次,例如:</p>
<pre><code>ALLOWED_RANGE = set(range(32, 127))
</code></pre>