用多个单词匹配州和城市问题的回答

用多个单词匹配州和城市

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个如下元素的Python列表： <pre><code>['Alabama[edit]', 'Auburn (Auburn University)[1]', 'Florence (University of North Alabama)', 'Jacksonville (Jacksonville State University)[2]', 'Livingston (University of West Alabama)[2]', 'Montevallo (University of Montevallo)[2]', 'Troy (Troy University)[2]', 'Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]', 'Tuskegee (Tuskegee University)[5]', 'Alaska[edit]', 'Fairbanks (University of Alaska Fairbanks)[2]', 'Arizona[edit]', 'Flagstaff (Northern Arizona University)[6]', 'Tempe (Arizona State University)', 'Tucson (University of Arizona)', 'Arkansas[edit]', 'Arkadelphia (Henderson State University, Ouachita Baptist University)[2]', 'Conway (Central Baptist College, Hendrix College, University of Central Arkansas)[2]', 'Fayetteville (University of Arkansas)[7]'] </code></pre> 这个清单并不完整，但足以让你了解其中的内容。在 数据结构如下： 有一个美国州的名字，在州名后面有一些城市的名字。如您所见，州名以“[edit]”结尾，城市名称要么在括号中以数字结尾（例如“<a href="https://drive.google.com/open?id=1fun9wuneVNjKZLUXtQIDWmZFAsMoTh-8" rel="nofollow noreferrer">1</a>”或“[2]”），要么在括号内加上大学名称（例如“（北阿拉巴马大学）”。在 （查找此问题的完整引用文件<a href="https://drive.google.com/open?id=1fun9wuneVNjKZLUXtQIDWmZFAsMoTh-8" rel="nofollow noreferrer">here</a>） 理想情况下，我需要一个以州名称为索引的Python字典，并将该州的所有城市名称嵌套列出作为特定索引的值。例如，字典应该是这样的： ^{pr2}$ 现在，我尝试了以下解决方案，去掉不必要的部分： <pre><code>import numpy as np import pandas as pd def get_list_of_university_towns(): ''' Returns a DataFrame of towns and the states they are in from the university_towns.txt list. The format of the DataFrame should be: DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], columns=["State", "RegionName"] ) The following cleaning needs to be done: 1. For "State", removing characters from "[" to the end. 2. For "RegionName", when applicable, removing every character from " (" to the end. 3. Depending on how you read the data, you may need to remove newline character '\n'. ''' fhandle = open("university_towns.txt") ftext = fhandle.read().split("\n") reftext = list() for item in ftext: reftext.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(item.split(" ")[0]) #pos = reftext[0].find("[") #reftext[0] = reftext[0][:pos] towns = list() dic = dict() for item in reftext: if item == "Alabama[edit]": state = "Alabama" elif item.endswith("[edit]"): dic[state] = towns towns = list() pos = item.find("[") item = item[:pos] state = item else: towns.append(item) return ftext get_list_of_university_towns() </code></pre> 代码生成的输出片段如下所示： <pre><code>{'Alabama': ['Auburn', 'Florence', 'Jacksonville', 'Livingston', 'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee'], 'Alaska': ['Fairbanks'], 'Arizona': ['Flagstaff', 'Tempe', 'Tucson'], 'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville', 'Jonesboro', 'Magnolia', 'Monticello', 'Russellville', 'Searcy'], 'California': ['Angwin', 'Arcata', 'Berkeley', 'Chico', 'Claremont', 'Cotati', 'Davis', 'Irvine', 'Isla', 'University', 'Merced', 'Orange', 'Palo', 'Pomona', 'Redlands', 'Riverside', 'Sacramento', 'University', 'San', 'San', 'Santa', 'Santa', 'Turlock', 'Westwood,', 'Whittier'], 'Colorado': ['Alamosa', 'Boulder', 'Durango', 'Fort', 'Golden', 'Grand', 'Greeley', 'Gunnison', 'Pueblo,'], 'Connecticut': ['Fairfield', 'Middletown', 'New', 'New', 'New', 'Storrs', 'Willimantic'], 'Delaware': ['Dover', 'Newark'], 'Florida': ['Ave', 'Boca', 'Coral', 'DeLand', 'Estero', 'Gainesville', 'Orlando', 'Sarasota', 'St.', 'St.', 'Tallahassee', 'Tampa'], 'Georgia': ['Albany', 'Athens', 'Atlanta', 'Carrollton', 'Demorest', 'Fort', 'Kennesaw', 'Milledgeville', 'Mount', 'Oxford', 'Rome', 'Savannah', 'Statesboro', 'Valdosta', 'Waleska', 'Young'], 'Hawaii': ['Manoa'], </code></pre> 但是，输出中有一个错误：名称中带有空格的州（如“北卡罗来纳州”）不包括在内。我能说出背后的原因。在 我曾想过使用正则表达式，但由于我还没有研究过正则表达式，我不知道如何形成正则表达式。有没有关于使用或不使用Regex的方法？在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

赞美正则表达式的力量吧： <pre><code>states_rx = re.compile(r''' ^ (?P<state>.+?)\[edit\] (?P<cities>[\s\S]+?) (?=^.*\[edit\]$|\Z) ''', re.MULTILINE | re.VERBOSE) cities_rx = re.compile(r'''^[^()\n]+''', re.MULTILINE) transformed = '\n'.join(lst_) result = {state.group('state'): [city.group(0).rstrip() for city in cities_rx.finditer(state.group('cities'))] for state in states_rx.finditer(transformed)} print(result) </code></pre> 这就产生了 ^{pr2}$ <hr/> <h3>说明：</h3> 这样做的目的是将任务分成几个较小的任务： <ol> <li>{Join<cd1>完成列表}</li> <li>独立州</li> <li>独立城镇</li> <li>对所有找到的项目进行听写理解</li> </ol> <hr/> 第一个子任务 ^{3}$ 第二个子任务 <pre><code>^ # match start of the line (?P<state>.+?)\[edit\] # capture anything in that line up to [edit] (?P<cities>[\s\S]+?) # afterwards match anything up to (?=^.*\[edit\]$|\Z) # ... either another state or the very end of the string </code></pre> 见<a href="https://regex101.com/r/ht9rTp/4" rel="nofollow noreferrer">the demo on regex101.com</a>。在 第三个子任务 <pre><code>^[^()\n]+ # match start of the line, anything not a newline character or ( or ) </code></pre> 见<a href="https://regex101.com/r/ht9rTp/2" rel="nofollow noreferrer">another demo on regex101.com</a>。在 第四个子任务 <pre><code>result = {state.group('state'): [city.group(0).rstrip() for city in cities_rx.finditer(state.group('cities'))] for state in states_rx.finditer(transformed)} </code></pre> 这大致相当于： <pre><code>for state in states_rx.finditer(transformed): # state is in state.group('state') for city in cities_rx.finditer(state.group('cities')): # city is in city.group(0), possibly with whitespaces # hence the rstrip </code></pre> <hr/> 最后，一些时间问题： <pre><code>import timeit print(timeit.timeit(findstatesandcities, number=10**5)) # 12.234304904000965 </code></pre> 因此，在我的电脑上运行上述a100000次需要大约12秒，所以它应该相当快。在

用多个单词匹配州和城市

1 个回答

相关Python问题