用多个单词匹配州和城市

['Alabama[edit]', 'Auburn (Auburn University)[1]', 'Florence (University of North Alabama)', 'Jacksonville (Jacksonville State University)[2]', 'Livingston (University of West Alabama)[2]', 'Montevallo (University of Montevallo)[2]', 'Troy (Troy University)[2]', 'Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]', 'Tuskegee (Tuskegee University)[5]', 'Alaska[edit]', 'Fairbanks (University of Alaska Fairbanks)[2]', 'Arizona[edit]', 'Flagstaff (Northern Arizona University)[6]', 'Tempe (Arizona State University)', 'Tucson (University of Arizona)', 'Arkansas[edit]', 'Arkadelphia (Henderson State University, Ouachita Baptist University)[2]', 'Conway (Central Baptist College, Hendrix College, University of Central Arkansas)[2]', 'Fayetteville (University of Arkansas)[7]']

import numpy as np import pandas as pd def get_list_of_university_towns(): ''' Returns a DataFrame of towns and the states they are in from the university_towns.txt list. The format of the DataFrame should be: DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], columns=["State", "RegionName"] ) The following cleaning needs to be done: 1. For "State", removing characters from "[" to the end. 2. For "RegionName", when applicable, removing every character from " (" to the end. 3. Depending on how you read the data, you may need to remove newline character '\n'. ''' fhandle = open("university_towns.txt") ftext = fhandle.read().split("\n") reftext = list() for item in ftext: reftext.append(item.split(" ")[0]) #pos = reftext[0].find("[") #reftext[0] = reftext[0][:pos] towns = list() dic = dict() for item in reftext: if item == "Alabama[edit]": state = "Alabama" elif item.endswith("[edit]"): dic[state] = towns towns = list() pos = item.find("[") item = item[:pos] state = item else: towns.append(item) return ftext get_list_of_university_towns()

{'Alabama': ['Auburn', 'Florence', 'Jacksonville', 'Livingston', 'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee'], 'Alaska': ['Fairbanks'], 'Arizona': ['Flagstaff', 'Tempe', 'Tucson'], 'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville', 'Jonesboro', 'Magnolia', 'Monticello', 'Russellville', 'Searcy'], 'California': ['Angwin', 'Arcata', 'Berkeley', 'Chico', 'Claremont', 'Cotati', 'Davis', 'Irvine', 'Isla', 'University', 'Merced', 'Orange', 'Palo', 'Pomona', 'Redlands', 'Riverside', 'Sacramento', 'University', 'San', 'San', 'Santa', 'Santa', 'Turlock', 'Westwood,', 'Whittier'], 'Colorado': ['Alamosa', 'Boulder', 'Durango', 'Fort', 'Golden', 'Grand', 'Greeley', 'Gunnison', 'Pueblo,'], 'Connecticut': ['Fairfield', 'Middletown', 'New', 'New', 'New', 'Storrs', 'Willimantic'], 'Delaware': ['Dover', 'Newark'], 'Florida': ['Ave', 'Boca', 'Coral', 'DeLand', 'Estero', 'Gainesville', 'Orlando', 'Sarasota', 'St.', 'St.', 'Tallahassee', 'Tampa'], 'Georgia': ['Albany', 'Athens', 'Atlanta', 'Carrollton', 'Demorest', 'Fort', 'Kennesaw', 'Milledgeville', 'Mount', 'Oxford', 'Rome', 'Savannah', 'Statesboro', 'Valdosta', 'Waleska', 'Young'], 'Hawaii': ['Manoa'],

3条回答

网友

1楼 · 编辑于 2024-10-01 02:36:35

你应该改变

fhandle = open("university_towns.txt")
ftext = fhandle.read().split("\n") 

# to

with open("university_towns.txt","r") as f:
    d = f.readlines()

# file is autoclosed here, lines are autosplit by readlines()

无正则表达式解决方案：

^{pr2}$

收益率（重新格式化）：

^{3}$

网友

2楼 · 编辑于 2024-10-01 02:36:35

赞美正则表达式的力量吧：

states_rx = re.compile(r'''
^
(?P<state>.+?)\[edit\]
(?P<cities>[\s\S]+?)
(?=^.*\[edit\]$|\Z)
''', re.MULTILINE | re.VERBOSE)

cities_rx = re.compile(r'''^[^()\n]+''', re.MULTILINE)

transformed = '\n'.join(lst_)

result = {state.group('state'): [city.group(0).rstrip() 
        for city in cities_rx.finditer(state.group('cities'))] 
        for state in states_rx.finditer(transformed)}
print(result)

这就产生了

^{pr2}$

说明：

这样做的目的是将任务分成几个较小的任务：

{Join完成列表}
独立州
独立城镇
对所有找到的项目进行听写理解

第一个子任务 ^{3}$

第二个子任务

^                      # match start of the line
(?P<state>.+?)\[edit\] # capture anything in that line up to [edit]
(?P<cities>[\s\S]+?)   # afterwards match anything up to
(?=^.*\[edit\]$|\Z)    # ... either another state or the very end of the string

见the demo on regex101.com。在

第三个子任务

^[^()\n]+              # match start of the line, anything not a newline character or ( or )

见another demo on regex101.com。在

第四个子任务

result = {state.group('state'): [city.group(0).rstrip() for city in cities_rx.finditer(state.group('cities'))] for state in states_rx.finditer(transformed)}

这大致相当于：

for state in states_rx.finditer(transformed):
    # state is in state.group('state')
    for city in cities_rx.finditer(state.group('cities')):
        # city is in city.group(0), possibly with whitespaces
        # hence the rstrip

最后，一些时间问题：

import timeit
print(timeit.timeit(findstatesandcities, number=10**5))
# 12.234304904000965

因此，在我的电脑上运行上述a100000次需要大约12秒，所以它应该相当快。在

网友

3楼 · 编辑于 2024-10-01 02:36:35

让我们一步一步解决你的问题：

First step:

收集所有的数据，这里我使用的是在任何状态名称出现时放置一个跟踪字，它会在单词“pos_flag”的帮助下跟踪和分块：

import re
pattern='\w+(?=\[edit\])'

track=[]
with open('mon.txt','r') as f:
    for line in f:
        match=re.search(pattern,line)
        if match:
            track.append('pos_flag')
            track.append(line.strip().split('[')[0])
        else:

            track.append(line.strip().split('(')[0])

它将产生如下输出：

^{pr2}$

正如你在每个州名之前看到的那样，现在让我们用这个词来做一些事情：

Second step:

跟踪列表中所有“pos_flag words”的索引：

^{3}$

这将产生如下输出：

[0, 10, 13, 18, 28, 55, 66, 75, 79, 93, 111, 114, 119, 131, 146, 161, 169, 182, 192, 203, 215, 236, 258, 274, 281, 292, 297, 306, 310, 319, 331, 338, 371, 391, 395, 419, 432, 444, 489, 493, 506, 512, 527, 551, 559, 567, 581, 588, 599, 614]

我们现在有了索引号，我们可以用这些索引号来链接：

Last step:

使用index no将列表分块，并将第一个单词设置为dict键，将其余单词设置为dict值：

city_dict={}
for i in range(0,len(index_no),1):
    try:
        value_1=track[index_no[i:i + 2][0]:index_no[i:i + 2][1]]
        city_dict[value_1[1]]=value_1[2:]
    except IndexError:
        city_dict[track[index_no[i:i + 2][0]:][1]]=track[index_no[i:i + 2][0]:][1:]

print(city_dict)

输出：

由于dict在python 3.5中没有排序，因此输出顺序与输入文件不同：

{'Kentucky': ['Bowling Green ', 'Columbia ', 'Georgetown ', 'Highland Heights ', 'Lexington ', 'Louisville ', 'Morehead ', 'Murray ', 'Richmond ', 'Williamsburg ', 'Wilmore '], 'Mississippi': ['Cleveland ', 'Hattiesburg ', 'Itta Bena ', 'Oxford ', 'Starkville '], 'Wisconsin': ['Appleton ', 'Eau Claire ', 'Green Bay ', 'La Crosse ', 'Madison ', 'Menomonie ', 'Milwaukee ',

完整代码：

import re
pattern='\w+(?=\[edit\])'

track=[]
with open('mon.txt','r') as f:
    for line in f:
        match=re.search(pattern,line)
        if match:
            track.append('pos_flag')
            track.append(line.strip().split('[')[0])
        else:

            track.append(line.strip().split('(')[0])


index_no=[]
for index,value in enumerate(track):
    if value=='pos_flag':
        index_no.append(index)


city_dict={}
for i in range(0,len(index_no),1):
    try:
        value_1=track[index_no[i:i + 2][0]:index_no[i:i + 2][1]]
        city_dict[value_1[1]]=value_1[2:]
    except IndexError:
        city_dict[track[index_no[i:i + 2][0]:][1]]=track[index_no[i:i + 2][0]:][1:]

print(city_dict)

Second solution:

如果要使用regex，请尝试以下小解决方案：

import re
pattern='((\w+\[edit\])(?:(?!^\w+\[edit\]).)*)'
with open('file.txt','r') as f:
    prt=re.finditer(pattern,f.read(),re.DOTALL | re.MULTILINE)

    for line in prt:
        dict_p={}
        match = []
        match.append(line.group(1))
        dict_p[match[0].split('\n')[0].strip().split('[')[0]]= [i.split('(')[0].strip() for i in match[0].split('\n')[1:][:-1]]

        print(dict_p)

它将提供：

{'Alabama': ['Auburn', 'Florence', 'Jacksonville', 'Livingston', 'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee']}
{'Alaska': ['Fairbanks']}
{'Arizona': ['Flagstaff', 'Tempe', 'Tucson']}
{'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville', 'Jonesboro', 'Magnolia', 'Monticello', 'Russellville', 'Searcy']}
{'California': ['Angwin', 'Arcata', 'Berkeley', 'Chico', 'Claremont', 'Cotati', 'Davis', 'Irvine', 'Isla Vista', 'University Park, Los Angeles', 'Merced', 'Orange', 'Palo Alto', 'Pomona', 'Redlands', 'Riverside', 'Sacramento', 'University District, San Bernardino', 'San Diego', 'San Luis Obispo', 'Santa Barbara', 'Santa Cruz', 'Turlock', 'Westwood, Los Angeles', 'Whittier']}
{'Colorado': ['Alamosa', 'Boulder', 'Durango', 'Fort Collins', 'Golden', 'Grand Junction', 'Greeley', 'Gunnison', 'Pueblo, Colorado']}

demo :

说明：

相关问题更多 >

编程相关推荐

热门问题

热门文章