Python删除重复名称

3210 <DOCID>GH950102-000003<DOCID>/O 3243 Australia/LOCATION 3360 England/LOCATION 3414 India/LOCATION 3474 Melbourne/LOCATION 3497 England/LOCATION 3521 >India<TOPONYM>/O 3526 >Zimbabwe<TOPONYM>/O 3531 >England<TOPONYM>/O 3536 >Melbourne<TOPONYM>/O 3541 >England<TOPONYM>/O 3546 >England<TOPONYM>/O 3551 >Glasgow<TOPONYM>/O 3556 >England<TOPONYM>/O 3561 >England<TOPONYM>/O 3566 >Australia<TOPONYM>/O 3568 <DOCID>GH950102-000004<DOCID>/O 3739 Hampden/LOCATION 3821 Hampden/LOCATION 3838 Ibrox/LOCATION 3861 Neerday/LOCATION 4161 Fir Park/LOCATION 4229 Park<TOPONYM>/O 4234 >Hampden<TOPONYM>/O 4239 >Hampden<TOPONYM>/O 4244 >Midfield<TOPONYM>/O 4249 >Glasgow<TOPONYM>/O 4251 <DOCID>GH950102-000005<DOCID>/O 4535 Edinburgh/LOCATION 4840 Road<TOPONYM>/O 4845 >Edinburgh<TOPONYM>/O 4850 >Glasgow<TOPONYM>/O``

3210 <DOCID>GH950102-000003<DOCID>/O 3243 Australia/LOCATION 3360 England/LOCATION 3414 India/LOCATION 3474 Melbourne/LOCATION 3497 England/LOCATION 3526 >Zimbabwe<TOPONYM>/O 3551 >Glasgow<TOPONYM>/O 3568 <DOCID>GH950102-000004<DOCID>/O 3739 Hampden/LOCATION 3838 Ibrox/LOCATION 3861 Neerday/LOCATION 4161 Fir Park/LOCATION 4229 Park<TOPONYM>/O 4244 >Midfield<TOPONYM>/O 4249 >Glasgow<TOPONYM>/O 4251 <DOCID>GH950102-000005<DOCID>/O 4535 Edinburgh/LOCATION 4840 Road<TOPONYM>/O 4850 >Glasgow<TOPONYM>/O

2条回答

网友

1楼 · 编辑于 2024-09-23 16:22:39

这里有一个方法。你知道吗

import string
filename = 'testfile'
lines = tuple(open(filename, 'r'))

final_list = []
unique_list = [] # this resets itself every docid
for line in lines:
    currentline = str(line)
    if 'DOCID' in currentline:
        unique_list = []  # this resets itself every docid
        final_list.append(line)
    else:
        exclude = set(string.punctuation)
        currentline = ''.join(ch if ch not in exclude else " " for ch in currentline)
        city = currentline.split()[1]
        if city not in unique_list:
            unique_list.append(city)
            final_list.append(line)

for line in final_list:
    print(line)

输出：

3210    <DOCID>GH950102-000003<DOCID>/O

  3243  Australia/LOCATION

  3360  England/LOCATION

  3414  India/LOCATION

  3474  Melbourne/LOCATION

  3526  >Zimbabwe<TOPONYM>/O

  3551  >Glasgow<TOPONYM>/O

3568    <DOCID>GH950102-000004<DOCID>/O

  3739  Hampden/LOCATION

  3838  Ibrox/LOCATION

  3861  Neerday/LOCATION

  4161  Fir Park/LOCATION

  4229  Park<TOPONYM>/O

  4244  >Midfield<TOPONYM>/O

  4249  >Glasgow<TOPONYM>/O

  4251  <DOCID>GH950102-000005<DOCID>/O

  4535  Edinburgh/LOCATION

  4840  Road<TOPONYM>/O

  4850  >Glasgow<TOPONYM>/O``

注意：testfile是一个包含输入文本的文本文件。如果需要，可以优化代码。你知道吗

网友

2楼 · 编辑于 2024-09-23 16:22:39

我是用手机写的，所以这不是一个完整的解决方案，关键是：

import re
Docid=re.compile("^ *\d+ +<DOCID>")
Location=re.compile("^ *\d +>?(. +)/")
Lines={} 
for line in file:
    if re.match(Docid,line):
        Lines={}
        print line
    else:
        loc=re.findall(Location, line)[0]
        if loc not in Lines.keys():
             print line
             Lines[loc] = True

基本上，它检查它的每一行不是一个新的docid。如果不是，则尝试读取位置，并查看是否已读取。如果没有，则打印位置并将其添加到位置列表tead中。你知道吗

如果有新的docid，它将重置最后一个读取位置。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章