<p>模式是任何单词字符后跟至少两个以上的字符</p>
<pre><code>s = [' are they saddddd?',
" I don't want to go",
' heyyyyy',
' 12333',
' 00unit',
' 00wolf',
' 01man',
' 20595',
' 2091996',
' 03dumbdumb']
df = pd.DataFrame(s,columns=['Sentence'])
In [25]: pattern = r'((\w)\2{2,})'
In [26]: df.loc[(df['Sentence'].str.findall(pattern).astype(bool)), 'Lab']=1
In [27]: df
Out[27]:
Sentence Lab
0 are they saddddd? 1.0
1 I don't want to go NaN
2 heyyyyy 1.0
3 12333 1.0
4 00unit NaN
5 00wolf NaN
6 01man NaN
7 20595 NaN
8 2091996 NaN
9 03dumbdumb NaN
</code></pre>
<hr/>
<p>或者<code>pattern = r'(([a-zA-Z0-9])\2{2,})'</code>如果不想匹配下划线</p>
<hr/>
<pre><code>pattern = r'(([a-zA-Z0-9])\2{2,})'
S = df.Sentence.str.findall(pattern)
df['Lab'] = S.astype(bool).astype(int)
In [13]: df
Out[13]:
Sentence Lab
0 are they saddddd? 1
1 I don't want to go 0
2 heyyyyy 1
3 12333 1
4 00unit 0
5 00wolf 0
6 01man 0
7 20595 0
8 2091996 0
9 03dumbdumb 0
</code></pre>