用多个单词匹配州和城市

2024-10-01 02:36:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个如下元素的Python列表:

['Alabama[edit]',
 'Auburn (Auburn University)[1]',
 'Florence (University of North Alabama)',
 'Jacksonville (Jacksonville State University)[2]',
 'Livingston (University of West Alabama)[2]',
 'Montevallo (University of Montevallo)[2]',
 'Troy (Troy University)[2]',
 'Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]',
 'Tuskegee (Tuskegee University)[5]',
 'Alaska[edit]',
 'Fairbanks (University of Alaska Fairbanks)[2]',
 'Arizona[edit]',
 'Flagstaff (Northern Arizona University)[6]',
 'Tempe (Arizona State University)',
 'Tucson (University of Arizona)',
 'Arkansas[edit]',
 'Arkadelphia (Henderson State University, Ouachita Baptist University)[2]',
 'Conway (Central Baptist College, Hendrix College, University of Central Arkansas)[2]',
 'Fayetteville (University of Arkansas)[7]']

这个清单并不完整,但足以让你了解其中的内容。在

数据结构如下:

有一个美国州的名字,在州名后面有一些城市的名字。如您所见,州名以“[edit]”结尾,城市名称要么在括号中以数字结尾(例如“1”或“[2]”),要么在括号内加上大学名称(例如“(北阿拉巴马大学)”。在

(查找此问题的完整引用文件here

理想情况下,我需要一个以州名称为索引的Python字典,并将该州的所有城市名称嵌套列出作为特定索引的值。例如,字典应该是这样的:

^{pr2}$

现在,我尝试了以下解决方案,去掉不必要的部分:

import numpy as np
import pandas as pd

    def get_list_of_university_towns():
        '''
        Returns a DataFrame of towns and the states they are in from the 
        university_towns.txt list. The format of the DataFrame should be:
        DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], 
        columns=["State", "RegionName"]  )

        The following cleaning needs to be done:

        1. For "State", removing characters from "[" to the end.
        2. For "RegionName", when applicable, removing every character from " (" to the end.
        3. Depending on how you read the data, you may need to remove newline character '\n'. 

        '''

        fhandle = open("university_towns.txt")
        ftext = fhandle.read().split("\n")

        reftext = list()
        for item in ftext:
            reftext.append(item.split(" ")[0])

        #pos = reftext[0].find("[")
        #reftext[0] = reftext[0][:pos]

        towns = list()
        dic = dict()

        for item in reftext:
            if item == "Alabama[edit]":
                state = "Alabama"

            elif item.endswith("[edit]"):
                dic[state] = towns
                towns = list()
                pos = item.find("[")
                item = item[:pos]
                state = item

            else:
                towns.append(item)

        return ftext

    get_list_of_university_towns()

代码生成的输出片段如下所示:

{'Alabama': ['Auburn',
  'Florence',
  'Jacksonville',
  'Livingston',
  'Montevallo',
  'Troy',
  'Tuscaloosa',
  'Tuskegee'],
 'Alaska': ['Fairbanks'],
 'Arizona': ['Flagstaff', 'Tempe', 'Tucson'],
 'Arkansas': ['Arkadelphia',
  'Conway',
  'Fayetteville',
  'Jonesboro',
  'Magnolia',
  'Monticello',
  'Russellville',
  'Searcy'],
 'California': ['Angwin',
  'Arcata',
  'Berkeley',
  'Chico',
  'Claremont',
  'Cotati',
  'Davis',
  'Irvine',
  'Isla',
  'University',
  'Merced',
  'Orange',
  'Palo',
  'Pomona',
  'Redlands',
  'Riverside',
  'Sacramento',
  'University',
  'San',
  'San',
  'Santa',
  'Santa',
  'Turlock',
  'Westwood,',
  'Whittier'],
 'Colorado': ['Alamosa',
  'Boulder',
  'Durango',
  'Fort',
  'Golden',
  'Grand',
  'Greeley',
  'Gunnison',
  'Pueblo,'],
 'Connecticut': ['Fairfield',
  'Middletown',
  'New',
  'New',
  'New',
  'Storrs',
  'Willimantic'],
 'Delaware': ['Dover', 'Newark'],
 'Florida': ['Ave',
  'Boca',
  'Coral',
  'DeLand',
  'Estero',
  'Gainesville',
  'Orlando',
  'Sarasota',
  'St.',
  'St.',
  'Tallahassee',
  'Tampa'],
 'Georgia': ['Albany',
  'Athens',
  'Atlanta',
  'Carrollton',
  'Demorest',
  'Fort',
  'Kennesaw',
  'Milledgeville',
  'Mount',
  'Oxford',
  'Rome',
  'Savannah',
  'Statesboro',
  'Valdosta',
  'Waleska',
  'Young'],
 'Hawaii': ['Manoa'],

但是,输出中有一个错误:名称中带有空格的州(如“北卡罗来纳州”)不包括在内。我能说出背后的原因。在

我曾想过使用正则表达式,但由于我还没有研究过正则表达式,我不知道如何形成正则表达式。有没有关于使用或不使用Regex的方法?在


Tags: oftheto名称itemeditliststate
3条回答

你应该改变

fhandle = open("university_towns.txt")
ftext = fhandle.read().split("\n") 

# to

with open("university_towns.txt","r") as f:
    d = f.readlines()

# file is autoclosed here, lines are autosplit by readlines()

无正则表达式解决方案:

^{pr2}$

收益率(重新格式化):

^{3}$

赞美正则表达式的力量吧:

states_rx = re.compile(r'''
^
(?P<state>.+?)\[edit\]
(?P<cities>[\s\S]+?)
(?=^.*\[edit\]$|\Z)
''', re.MULTILINE | re.VERBOSE)

cities_rx = re.compile(r'''^[^()\n]+''', re.MULTILINE)

transformed = '\n'.join(lst_)

result = {state.group('state'): [city.group(0).rstrip() 
        for city in cities_rx.finditer(state.group('cities'))] 
        for state in states_rx.finditer(transformed)}
print(result)

这就产生了

^{pr2}$


说明:

这样做的目的是将任务分成几个较小的任务:

  1. {Join完成列表}
  2. 独立州
  3. 独立城镇
  4. 对所有找到的项目进行听写理解


第一个子任务 ^{3}$

第二个子任务

^                      # match start of the line
(?P<state>.+?)\[edit\] # capture anything in that line up to [edit]
(?P<cities>[\s\S]+?)   # afterwards match anything up to
(?=^.*\[edit\]$|\Z)    # ... either another state or the very end of the string

the demo on regex101.com。在

第三个子任务

^[^()\n]+              # match start of the line, anything not a newline character or ( or )

another demo on regex101.com。在

第四个子任务

result = {state.group('state'): [city.group(0).rstrip() for city in cities_rx.finditer(state.group('cities'))] for state in states_rx.finditer(transformed)}

这大致相当于:

for state in states_rx.finditer(transformed):
    # state is in state.group('state')
    for city in cities_rx.finditer(state.group('cities')):
        # city is in city.group(0), possibly with whitespaces
        # hence the rstrip


最后,一些时间问题:
import timeit
print(timeit.timeit(findstatesandcities, number=10**5))
# 12.234304904000965

因此,在我的电脑上运行上述a100000次需要大约12秒,所以它应该相当快。在

让我们一步一步解决你的问题:

First step:

收集所有的数据,这里我使用的是在任何状态名称出现时放置一个跟踪字,它会在单词“pos_flag”的帮助下跟踪和分块:

import re
pattern='\w+(?=\[edit\])'

track=[]
with open('mon.txt','r') as f:
    for line in f:
        match=re.search(pattern,line)
        if match:
            track.append('pos_flag')
            track.append(line.strip().split('[')[0])
        else:

            track.append(line.strip().split('(')[0])

它将产生如下输出:

^{pr2}$

正如你在每个州名之前看到的那样,现在让我们用这个词来做一些事情:

Second step:

跟踪列表中所有“pos_flag words”的索引:

^{3}$

这将产生如下输出:

[0, 10, 13, 18, 28, 55, 66, 75, 79, 93, 111, 114, 119, 131, 146, 161, 169, 182, 192, 203, 215, 236, 258, 274, 281, 292, 297, 306, 310, 319, 331, 338, 371, 391, 395, 419, 432, 444, 489, 493, 506, 512, 527, 551, 559, 567, 581, 588, 599, 614]

我们现在有了索引号,我们可以用这些索引号来链接:

Last step:

使用index no将列表分块,并将第一个单词设置为dict键,将其余单词设置为dict值:

city_dict={}
for i in range(0,len(index_no),1):
    try:
        value_1=track[index_no[i:i + 2][0]:index_no[i:i + 2][1]]
        city_dict[value_1[1]]=value_1[2:]
    except IndexError:
        city_dict[track[index_no[i:i + 2][0]:][1]]=track[index_no[i:i + 2][0]:][1:]

print(city_dict)

输出:

由于dict在python 3.5中没有排序,因此输出顺序与输入文件不同:

{'Kentucky': ['Bowling Green ', 'Columbia ', 'Georgetown ', 'Highland Heights ', 'Lexington ', 'Louisville ', 'Morehead ', 'Murray ', 'Richmond ', 'Williamsburg ', 'Wilmore '], 'Mississippi': ['Cleveland ', 'Hattiesburg ', 'Itta Bena ', 'Oxford ', 'Starkville '], 'Wisconsin': ['Appleton ', 'Eau Claire ', 'Green Bay ', 'La Crosse ', 'Madison ', 'Menomonie ', 'Milwaukee ', 

完整代码:

import re
pattern='\w+(?=\[edit\])'

track=[]
with open('mon.txt','r') as f:
    for line in f:
        match=re.search(pattern,line)
        if match:
            track.append('pos_flag')
            track.append(line.strip().split('[')[0])
        else:

            track.append(line.strip().split('(')[0])


index_no=[]
for index,value in enumerate(track):
    if value=='pos_flag':
        index_no.append(index)


city_dict={}
for i in range(0,len(index_no),1):
    try:
        value_1=track[index_no[i:i + 2][0]:index_no[i:i + 2][1]]
        city_dict[value_1[1]]=value_1[2:]
    except IndexError:
        city_dict[track[index_no[i:i + 2][0]:][1]]=track[index_no[i:i + 2][0]:][1:]

print(city_dict)

Second solution:

如果要使用regex,请尝试以下小解决方案:

import re
pattern='((\w+\[edit\])(?:(?!^\w+\[edit\]).)*)'
with open('file.txt','r') as f:
    prt=re.finditer(pattern,f.read(),re.DOTALL | re.MULTILINE)

    for line in prt:
        dict_p={}
        match = []
        match.append(line.group(1))
        dict_p[match[0].split('\n')[0].strip().split('[')[0]]= [i.split('(')[0].strip() for i in match[0].split('\n')[1:][:-1]]

        print(dict_p)

它将提供:

{'Alabama': ['Auburn', 'Florence', 'Jacksonville', 'Livingston', 'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee']}
{'Alaska': ['Fairbanks']}
{'Arizona': ['Flagstaff', 'Tempe', 'Tucson']}
{'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville', 'Jonesboro', 'Magnolia', 'Monticello', 'Russellville', 'Searcy']}
{'California': ['Angwin', 'Arcata', 'Berkeley', 'Chico', 'Claremont', 'Cotati', 'Davis', 'Irvine', 'Isla Vista', 'University Park, Los Angeles', 'Merced', 'Orange', 'Palo Alto', 'Pomona', 'Redlands', 'Riverside', 'Sacramento', 'University District, San Bernardino', 'San Diego', 'San Luis Obispo', 'Santa Barbara', 'Santa Cruz', 'Turlock', 'Westwood, Los Angeles', 'Whittier']}
{'Colorado': ['Alamosa', 'Boulder', 'Durango', 'Fort Collins', 'Golden', 'Grand Junction', 'Greeley', 'Gunnison', 'Pueblo, Colorado']}

demo :

相关问题 更多 >