Python删除重复名称

2024-09-23 16:22:39 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个纯文本文件,每行都有单词:

3210    <DOCID>GH950102-000003<DOCID>/O
  3243  Australia/LOCATION
  3360  England/LOCATION
  3414  India/LOCATION
  3474  Melbourne/LOCATION
  3497  England/LOCATION
  3521  >India<TOPONYM>/O
  3526  >Zimbabwe<TOPONYM>/O
  3531  >England<TOPONYM>/O
  3536  >Melbourne<TOPONYM>/O
  3541  >England<TOPONYM>/O
  3546  >England<TOPONYM>/O
  3551  >Glasgow<TOPONYM>/O
  3556  >England<TOPONYM>/O
  3561  >England<TOPONYM>/O
  3566  >Australia<TOPONYM>/O
3568    <DOCID>GH950102-000004<DOCID>/O
  3739  Hampden/LOCATION
  3821  Hampden/LOCATION
  3838  Ibrox/LOCATION
  3861  Neerday/LOCATION
  4161  Fir Park/LOCATION
  4229  Park<TOPONYM>/O
  4234  >Hampden<TOPONYM>/O
  4239  >Hampden<TOPONYM>/O
  4244  >Midfield<TOPONYM>/O
  4249  >Glasgow<TOPONYM>/O
  4251  <DOCID>GH950102-000005<DOCID>/O
  4535  Edinburgh/LOCATION
  4840  Road<TOPONYM>/O
  4845  >Edinburgh<TOPONYM>/O
  4850  >Glasgow<TOPONYM>/O``

我想删除此列表中的相同位置名,它应该如下所示:

3210    <DOCID>GH950102-000003<DOCID>/O
  3243  Australia/LOCATION
  3360  England/LOCATION
  3414  India/LOCATION
  3474  Melbourne/LOCATION
  3497  England/LOCATION
  3526  >Zimbabwe<TOPONYM>/O
  3551  >Glasgow<TOPONYM>/O
3568    <DOCID>GH950102-000004<DOCID>/O
  3739  Hampden/LOCATION
  3838  Ibrox/LOCATION
  3861  Neerday/LOCATION
  4161  Fir Park/LOCATION
  4229  Park<TOPONYM>/O
  4244  >Midfield<TOPONYM>/O
  4249  >Glasgow<TOPONYM>/O
  4251  <DOCID>GH950102-000005<DOCID>/O
  4535  Edinburgh/LOCATION
  4840  Road<TOPONYM>/O
  4850  >Glasgow<TOPONYM>/O

我想删除重复的位置名称和docid应该留在文件中。我知道有一种方法可以通过linux使用uniq,但是如果我运行uniq,它将删除不同docid中的位置。 是否有任何方式来分割它通过每个docid和docid内如果位置名称相同,那么它应该删除重复的名称。你知道吗


Tags: 名称parklocationdocidindiaaustraliaedinburghmelbourne
2条回答

这里有一个方法。你知道吗

import string
filename = 'testfile'
lines = tuple(open(filename, 'r'))

final_list = []
unique_list = [] # this resets itself every docid
for line in lines:
    currentline = str(line)
    if 'DOCID' in currentline:
        unique_list = []  # this resets itself every docid
        final_list.append(line)
    else:
        exclude = set(string.punctuation)
        currentline = ''.join(ch if ch not in exclude else " " for ch in currentline)
        city = currentline.split()[1]
        if city not in unique_list:
            unique_list.append(city)
            final_list.append(line)

for line in final_list:
    print(line)

输出:

3210    <DOCID>GH950102-000003<DOCID>/O

  3243  Australia/LOCATION

  3360  England/LOCATION

  3414  India/LOCATION

  3474  Melbourne/LOCATION

  3526  >Zimbabwe<TOPONYM>/O

  3551  >Glasgow<TOPONYM>/O

3568    <DOCID>GH950102-000004<DOCID>/O

  3739  Hampden/LOCATION

  3838  Ibrox/LOCATION

  3861  Neerday/LOCATION

  4161  Fir Park/LOCATION

  4229  Park<TOPONYM>/O

  4244  >Midfield<TOPONYM>/O

  4249  >Glasgow<TOPONYM>/O

  4251  <DOCID>GH950102-000005<DOCID>/O

  4535  Edinburgh/LOCATION

  4840  Road<TOPONYM>/O

  4850  >Glasgow<TOPONYM>/O``

注意:testfile是一个包含输入文本的文本文件。如果需要,可以优化代码。你知道吗

我是用手机写的,所以这不是一个完整的解决方案,关键是:

import re
Docid=re.compile("^ *\d+ +<DOCID>")
Location=re.compile("^ *\d +>?(. +)/")
Lines={} 
for line in file:
    if re.match(Docid,line):
        Lines={}
        print line
    else:
        loc=re.findall(Location, line)[0]
        if loc not in Lines.keys():
             print line
             Lines[loc] = True

基本上,它检查它的每一行不是一个新的docid。如果不是,则尝试读取位置,并查看是否已读取。如果没有,则打印位置并将其添加到位置列表tead中。你知道吗

如果有新的docid,它将重置最后一个读取位置。你知道吗

相关问题 更多 >