<p>让我们一步一步解决你的问题:</p>
<blockquote>
<p>First step:</p>
</blockquote>
<p>收集所有的数据,这里我使用的是在任何状态名称出现时放置一个跟踪字,它会在单词“pos_flag”的帮助下跟踪和分块:</p>
<pre><code>import re
pattern='\w+(?=\[edit\])'
track=[]
with open('mon.txt','r') as f:
for line in f:
match=re.search(pattern,line)
if match:
track.append('pos_flag')
track.append(line.strip().split('[')[0])
else:
track.append(line.strip().split('(')[0])
</code></pre>
<p>它将产生如下输出:</p>
^{pr2}$
<p>正如你在每个州名之前看到的那样,现在让我们用这个词来做一些事情:</p>
<blockquote>
<p>Second step:</p>
</blockquote>
<p>跟踪列表中所有“pos_flag words”的索引:</p>
^{3}$
<p>这将产生如下输出:</p>
<pre><code>[0, 10, 13, 18, 28, 55, 66, 75, 79, 93, 111, 114, 119, 131, 146, 161, 169, 182, 192, 203, 215, 236, 258, 274, 281, 292, 297, 306, 310, 319, 331, 338, 371, 391, 395, 419, 432, 444, 489, 493, 506, 512, 527, 551, 559, 567, 581, 588, 599, 614]
</code></pre>
<p>我们现在有了索引号,我们可以用这些索引号来链接:</p>
<blockquote>
<p>Last step:</p>
</blockquote>
<p>使用index no将列表分块,并将第一个单词设置为dict键,将其余单词设置为dict值:</p>
<pre><code>city_dict={}
for i in range(0,len(index_no),1):
try:
value_1=track[index_no[i:i + 2][0]:index_no[i:i + 2][1]]
city_dict[value_1[1]]=value_1[2:]
except IndexError:
city_dict[track[index_no[i:i + 2][0]:][1]]=track[index_no[i:i + 2][0]:][1:]
print(city_dict)
</code></pre>
<p>输出:</p>
<p>由于dict在python 3.5中没有排序,因此输出顺序与输入文件不同:</p>
<pre><code>{'Kentucky': ['Bowling Green ', 'Columbia ', 'Georgetown ', 'Highland Heights ', 'Lexington ', 'Louisville ', 'Morehead ', 'Murray ', 'Richmond ', 'Williamsburg ', 'Wilmore '], 'Mississippi': ['Cleveland ', 'Hattiesburg ', 'Itta Bena ', 'Oxford ', 'Starkville '], 'Wisconsin': ['Appleton ', 'Eau Claire ', 'Green Bay ', 'La Crosse ', 'Madison ', 'Menomonie ', 'Milwaukee ',
</code></pre>
<p>完整代码:</p>
<pre><code>import re
pattern='\w+(?=\[edit\])'
track=[]
with open('mon.txt','r') as f:
for line in f:
match=re.search(pattern,line)
if match:
track.append('pos_flag')
track.append(line.strip().split('[')[0])
else:
track.append(line.strip().split('(')[0])
index_no=[]
for index,value in enumerate(track):
if value=='pos_flag':
index_no.append(index)
city_dict={}
for i in range(0,len(index_no),1):
try:
value_1=track[index_no[i:i + 2][0]:index_no[i:i + 2][1]]
city_dict[value_1[1]]=value_1[2:]
except IndexError:
city_dict[track[index_no[i:i + 2][0]:][1]]=track[index_no[i:i + 2][0]:][1:]
print(city_dict)
</code></pre>
<blockquote>
<p>Second solution:</p>
</blockquote>
<p>如果要使用regex,请尝试以下小解决方案:</p>
<pre><code>import re
pattern='((\w+\[edit\])(?:(?!^\w+\[edit\]).)*)'
with open('file.txt','r') as f:
prt=re.finditer(pattern,f.read(),re.DOTALL | re.MULTILINE)
for line in prt:
dict_p={}
match = []
match.append(line.group(1))
dict_p[match[0].split('\n')[0].strip().split('[')[0]]= [i.split('(')[0].strip() for i in match[0].split('\n')[1:][:-1]]
print(dict_p)
</code></pre>
<p>它将提供:</p>
<pre><code>{'Alabama': ['Auburn', 'Florence', 'Jacksonville', 'Livingston', 'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee']}
{'Alaska': ['Fairbanks']}
{'Arizona': ['Flagstaff', 'Tempe', 'Tucson']}
{'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville', 'Jonesboro', 'Magnolia', 'Monticello', 'Russellville', 'Searcy']}
{'California': ['Angwin', 'Arcata', 'Berkeley', 'Chico', 'Claremont', 'Cotati', 'Davis', 'Irvine', 'Isla Vista', 'University Park, Los Angeles', 'Merced', 'Orange', 'Palo Alto', 'Pomona', 'Redlands', 'Riverside', 'Sacramento', 'University District, San Bernardino', 'San Diego', 'San Luis Obispo', 'Santa Barbara', 'Santa Cruz', 'Turlock', 'Westwood, Los Angeles', 'Whittier']}
{'Colorado': ['Alamosa', 'Boulder', 'Durango', 'Fort Collins', 'Golden', 'Grand Junction', 'Greeley', 'Gunnison', 'Pueblo, Colorado']}
</code></pre>
<p><a href="https://regex101.com/r/V0H5vz/7" rel="nofollow noreferrer">demo :</a></p>