擅长:python、mysql、java
<p>这里有一个有效的解决方案,它使用英语单词列表。只是它对{<cd1>}和{<cd2>}不准确,但就像你说的,这很难达到100%的准确度</p>
<pre><code>url = 'https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt'
words = set(pd.read_csv(url, header=None)[0])
w1 = df['Company Name'].str.split()
m1 = ~w1.str[0].str.lower().isin(words) # is not an english word
m2 = w1.str[0].str.len().le(4) # first word is < 4 characters
df.loc[m1 & m2, 'Company Name'] = w1.str[0].str.upper() + ' ' + w1.str[1:].str.join(' ')
Company Name
0 Visa Inc
1 MSCI Inc
2 Coca Cola Inc
3 PNC Bank
4 AIG Corp
5 Td Ameritrade
6 UBER Inc
7 Costco Inc
8 New York Times
</code></pre>
<p><strong>注意</strong>:我也用<code>nltk</code>软件包尝试了这一点,但显然,<code>nltk.corpus.words</code>模块到目前为止还没有完整的英文单词</p>