解析txt的特定区域,比较字符串列表,然后生成由匹配项组成的新列表

2024-06-01 06:43:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在努力做到以下几点:

  1. 通读文本文件的特定部分(有一个已知的起点和终点)
  2. 在阅读这些行时,检查一个单词是否与我在列表中包含的单词匹配
  3. 如果检测到匹配,则将该特定单词添加到新列表中

我已经能够通读文本并从中获取我需要的其他数据,但到目前为止,我还无法做到上面提到的

我已尝试实现以下示例:Python - Search Text File For Any String In a List 但我没能让它读对

我还试图修改以下内容:https://www.geeksforgeeks.org/python-finding-strings-with-given-substring-in-list/ 但我也同样不成功

以下是我的一些代码:

import re
from itertools import islice
import os

# list of all countries
oneCountries = "Afghanistan, Albania, Algeria, Andorra, Angola, Antigua & Deps, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bhutan, Bolivia, Bosnia Herzegovina, Botswana, Brazil, Brunei, Bulgaria, Burkina, Burma, Burundi, Cambodia, Cameroon, Canada, Cape Verde, Central African Rep, Chad, Chile, China, Republic of China, Colombia, Comoros, Democratic Republic of the Congo, Republic of the Congo, Costa Rica,, Croatia, Cuba, Cyprus, Czech Republic, Danzig, Denmark, Djibouti, Dominica, Dominican Republic, East Timor, Ecuador, Egypt, El Salvador, Equatorial Guinea, Eritrea, Estonia, Ethiopia, Fiji, Finland, France, Gabon, Gaza Strip, The Gambia, Georgia, Germany, Ghana, Greece, Grenada, Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Holy Roman Empire, Honduras, Hungary, Iceland, India, Indonesia, Iran, Iraq, Republic of Ireland, Israel, Italy, Ivory Coast, Jamaica, Japan, Jonathanland, Jordan, Kazakhstan, Kenya, Kiribati, North Korea, South Korea, Kosovo, Kuwait, Kyrgyzstan, Laos, Latvia, Lebanon, Lesotho, Liberia, Libya, Liechtenstein, Lithuania, Luxembourg, Macedonia, Madagascar, Malawi, Malaysia, Maldives, Mali, Malta, Marshall Islands, Mauritania, Mauritius, Mexico, Micronesia, Moldova, Monaco, Mongolia, Montenegro, Morocco, Mount Athos, Mozambique, Namibia, Nauru, Nepal, Newfoundland, Netherlands, New Zealand, Nicaragua, Niger, Nigeria, Norway, Oman, Ottoman Empire, Pakistan, Palau, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Prussia, Qatar, Romania, Rome, Russian Federation, Rwanda, St Kitts & Nevis, St Lucia, Saint Vincent & the Grenadines, Samoa, San Marino, Sao Tome & Principe, Saudi Arabia, Senegal, Serbia, Seychelles, Sierra Leone, Singapore, Slovakia, Slovenia, Solomon Islands, Somalia, South Africa, Spain, Sri Lanka, Sudan, Suriname, Swaziland, Sweden, Switzerland, Syria, Tajikistan, Tanzania, Thailand, Togo, Tonga, Trinidad & Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Vanuatu, Vatican City, Venezuela, Vietnam, Yemen, Zambia, Zimbabwe"
countries = oneCountries.split(",")

path = "C:/Users/me/Desktop/read.txt"
thefile = open(path, errors='ignore')

countryParsing = False
for line in thefile:
    line = line.strip()
#    if line.startswith("Submitting Author:"):
#    if re.match(r"Submitting Author:", line):
#        print("blahblah1")
#        countryParsing = True
#        if countryParsing == True:
#            print("blahblah2")
#            
#            res = [x for x in line if re.search(countries, x)]
#            print("blah blah3: " + str(res))
#    elif re.match(r"Running Head:", line):
#        countryParsing = False
#    if countryParsing == True:
#        res = [x for x in line if re.search(countries, x)]
#        print("blah blah4: " + str(res))


#        for x in countries:
#            if x in thefile:
#                print("a country is: " + x)
#        if any(s in line for s in countries):
#            listOfAuthorCountries = listOfAuthorCountries + s + ", "
#    if re.match(f"Submitting Author:, line"):

#注释掉的行是我尝试过的代码版本,但未能正常工作

根据要求,这是我试图从中获取数据的文本文件的一个示例。我对其进行了修改以删除敏感信息,但在这种情况下,“新列表”应附加一定数量的“法国”条目:

    txt above....
Submitting Author:

    asdf, asdf  (proxy)
    France
    asdfasdf
    blah blah
    asdfasdf

    asdf, Provence-Alpes-Côte d'Azu 13354
    France

    blah blah
    France
    asdf
Running Head:
    ...more text below

Tags: ofinreforiflinecountriesauthor
2条回答

我认为您的主要问题是,在oneCountries中,国家名称用逗号+空格分隔,但您只使用逗号分隔,因此countries中的第二个条目是" Albania",前面有空格。您需要更改:

oneCountries.split(",")

致:

oneCountries.split(", ")

在那之后,在你注释掉的代码中似乎有足够的有用的东西来实现你想要的

基于您所陈述的三点,即您希望实现的目标,以及我从您的代码中所了解的内容(这可能不是您想要的),我建议:

# list of all countries
countries = "Afghanistan, Albania, Algeria, Andorra, Angola, Antigua & Deps, Argentina, Armenia, Australia, Austria, Azerbaijan, Bahamas, Bahrain, Bangladesh, Barbados, Belarus, Belgium, Belize, Benin, Bhutan, Bolivia, Bosnia Herzegovina, Botswana, Brazil, Brunei, Bulgaria, Burkina, Burma, Burundi, Cambodia, Cameroon, Canada, Cape Verde, Central African Rep, Chad, Chile, China, Republic of China, Colombia, Comoros, Democratic Republic of the Congo, Republic of the Congo, Costa Rica, Croatia, Cuba, Cyprus, Czech Republic, Danzig, Denmark, Djibouti, Dominica, Dominican Republic, East Timor, Ecuador, Egypt, El Salvador, Equatorial Guinea, Eritrea, Estonia, Ethiopia, Fiji, Finland, France, Gabon, Gaza Strip, The Gambia, Georgia, Germany, Ghana, Greece, Grenada, Guatemala, Guinea, Guinea-Bissau, Guyana, Haiti, Holy Roman Empire, Honduras, Hungary, Iceland, India, Indonesia, Iran, Iraq, Republic of Ireland, Israel, Italy, Ivory Coast, Jamaica, Japan, Jonathanland, Jordan, Kazakhstan, Kenya, Kiribati, North Korea, South Korea, Kosovo, Kuwait, Kyrgyzstan, Laos, Latvia, Lebanon, Lesotho, Liberia, Libya, Liechtenstein, Lithuania, Luxembourg, Macedonia, Madagascar, Malawi, Malaysia, Maldives, Mali, Malta, Marshall Islands, Mauritania, Mauritius, Mexico, Micronesia, Moldova, Monaco, Mongolia, Montenegro, Morocco, Mount Athos, Mozambique, Namibia, Nauru, Nepal, Newfoundland, Netherlands, New Zealand, Nicaragua, Niger, Nigeria, Norway, Oman, Ottoman Empire, Pakistan, Palau, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Poland, Portugal, Prussia, Qatar, Romania, Rome, Russian Federation, Rwanda, St Kitts & Nevis, St Lucia, Saint Vincent & the Grenadines, Samoa, San Marino, Sao Tome & Principe, Saudi Arabia, Senegal, Serbia, Seychelles, Sierra Leone, Singapore, Slovakia, Slovenia, Solomon Islands, Somalia, South Africa, Spain, Sri Lanka, Sudan, Suriname, Swaziland, Sweden, Switzerland, Syria, Tajikistan, Tanzania, Thailand, Togo, Tonga, Trinidad & Tobago, Tunisia, Turkey, Turkmenistan, Tuvalu, Uganda, Ukraine, United Arab Emirates, United Kingdom, United States, Uruguay, Uzbekistan, Vanuatu, Vatican City, Venezuela, Vietnam, Yemen, Zambia, Zimbabwe"
countries = countries.split(",")
countries = [c.strip() for c in countries]

filename = "read.txt"
filehandle = open(filename, errors='ignore')
my_other_list = []
toParse = False
for line in filehandle:
    line = line.strip()
    if line.startswith("Submitting Author:"):
        toParse = True
        continue
    elif line.startswith("Running Head:"):
        toParse = False
        continue
    elif toParse:
        for c in countries:
            if c in line:
                my_other_list.append(c)

编辑摘要

  1. 调整代码以处理提供的文本示例

  2. 修正了国家列表(最初哥斯达黎加之后有两个逗号)

相关问题 更多 >