<p>我有一个数据集,我试图从这里显示的较长的混乱版本中提取简单的城镇名称。它们中的大多数后跟括号“(.*)”,但有些不遵循此模式,以“:”结尾(参见第200行)。最后,有些没有圆括号,但用逗号“,”分隔部分(见第240行,246)。你知道吗</p>
<pre><code> 'Region'
196 Boston (Boston University, Boston College, Bos...
197 Bridgewater (Bridgewater State College)[2]
198 Cambridge (Harvard University, Massachusetts I...
199 Chestnut Hill (Boston College)
200 The Colleges of Worcester Consortium:
201 Dudley (Nichols College)
240 Faribault, South Central College
241 Mankato (Minnesota State University, Mankato),...
242 Marshall (Southwest Minnesota State University...
243 Moorhead (Minnesota State University, Moorhead...
244 Morris (University of Minnesota Morris)[2]
245 Northfield (Carleton College, St. Olaf College...
246 North Mankato, South Central College
247 St. Cloud (St. Cloud State University, The Col...
248 St. Joseph (College of Saint Benedict)[2]
249 St. Peter (Gustavus Adolphus College)[2]
</code></pre>
<p>我最想看到的是:</p>
<pre><code> 'RegionName'
196 Boston
197 Bridgewater
198 Cambridge
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato
242 Marshall
243 Moorhead
244 Morris
245 Northfield
246 North Mankato
247 St. Cloud
248 St. Joseph
249 St. Peter
</code></pre>
<p>我目前的代码是:</p>
<pre><code>df['RegionName'] = df['Region'].str.extract('(.*)[:(,]', expand=False)
</code></pre>
<p>但这给了我一个奇怪的结果,就是括号不正确:</p>
<pre><code>196 Boston (Boston University, Boston College, Bos...
197 Bridgewater
198 Cambridge (Harvard University, Massachusetts I...
199 Chestnut Hill
200 The Colleges of Worcester Consortium
201 Dudley
240 Faribault
241 Mankato (Minnesota State University, Mankato)
242 Marshall
243 Moorhead (Minnesota State University, Moorhead
244 Morris
245 Northfield (Carleton College
246 North Mankato
247 St. Cloud (St. Cloud State University
248 St. Joseph
249 St. Peter
</code></pre>
<p>我也尝试过:</p>
<pre><code>df['RegionName'] = df['Region'].str.extract('(.*)[ (.*|:|,]', expand=False)
</code></pre>
<p>我不确定如何同时使用这三种模式提取字符串。也会有两条线的解决方案。
谢谢(如果格式不好,我深表歉意!)你知道吗</p>