忽略字符串中的数字

2024-10-03 09:07:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个如下的输入文件:

op.txt

          user id                        query
4d67373f-ca45-4137-efd0-0da69c78123d , bookmy show
4d67373f-ca45-4137-efd0-0da69c78123d , book my show
4d67373f-ca45-4137-efd0-0da69c78123d , book my show
4d67373f-ca45-4137-efd0-0da69c78123d , book my show
7fda21a5-c432-4d95-f93d-6275b68bb396 , 8 gb pen drive
7fda21a5-c432-4d95-f93d-6275b68bb396 , 16 gb pen drive
dba91160-dec4-454c-f34a-c29d6d95c459 , DVD PLATERS
dba91160-dec4-454c-f34a-c29d6d95c459 , DVD PLAYERS
dba91160-dec4-454c-f34a-c29d6d95c459 , DVD PLAYERS
dba91160-dec4-454c-f34a-c29d6d95c459 , IPOD
dba91160-dec4-454c-f34a-c29d6d95c459 , IPOD
dba91160-dec4-454c-f34a-c29d6d95c459 , IPOD
dba91160-dec4-454c-f34a-c29d6d95c459 , IPAD
d900ec5f-bd71-4e2b-84d0-6a2105050923 , minoxidil
d900ec5f-bd71-4e2b-84d0-6a2105050923 , minoxidil 5
775f1159-e310-42b6-d3b0-5ea3fb959568 , printed backcase for xperia L
775f1159-e310-42b6-d3b0-5ea3fb959568 , printed backcase for xperia zr
775f1159-e310-42b6-d3b0-5ea3fb959568 , printed backcase for xperia zr
9b98a9be-bb63-4310-87d5-592a66ae602a , leggings
9b98a9be-bb63-4310-87d5-592a66ae602a , leggings
9b98a9be-bb63-4310-87d5-592a66ae602a , jeggings
83618338-70a0-4512-c763-0307fe5acef0 , woman jacket
83618338-70a0-4512-c763-0307fe5acef0 , woman jacket
83618338-70a0-4512-c763-0307fe5acef0 , man jacket
83618338-70a0-4512-c763-0307fe5acef0 , man jacket

从中我发现如下输出:

dvd platers >  dvd players
ipod >  ipad
bookmy show >  book my show
leggings >  jeggings
woman jacket >  man jacket
minoxidil >  minoxidil 5
printed backcase for xperia l >  printed backcase for xperia zr
8 gb pen drive >  16 gb pen drive

主要目的是找到所有特定用户给定的查询,并存储在一个列表中。由此我需要找出所有查询的编辑距离。如果编辑距离小于2,那么我需要打印它。我的代码可以很好地找到,但是它不应该检查任何数字的变化,它只需要检查单词。例如,如果用户键入“8 gb pen drive”,一段时间后用户改变主意并键入“16 gb pen drive”我不想打印它。你知道吗

下面是我的代码:

 def min_edit_dist(s1, s2):
    m=len(s1)+1
    n=len(s2)+1
    tbl = {}
    for i in range(m): tbl[i,0]=i
    for j in range(n): tbl[0,j]=j
    for i in range(1, m):
        for j in range(1, n):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)
    return tbl[i,j]
    with open("op.txt") as text:
       d = {}
       for line in text:
          line = line.strip("\n")
          for lines in line.split("\n"):
            try:
                key, val = lines.split(",")
                d.setdefault(key,[]).append(val.lower())
            except:
                pass
    values = d.values()
    keys = d.keys()
    for v in values:
        for i in range(0,len(v)-1):
           if v[i]!= v[i+1]:
              if min_edit_dist(v[i], v[i+1]) <= 2:
                  print v[i]+" > "+v[i+1]

我只需要如下输出:

dvd platers >  dvd players
ipod >  ipad
bookmy show >  book my show
leggings >  jeggings
woman jacket >  man jacket
printed backcase for xperia l >  printed backcase for xperia zr

Tags: inforshowdrivetblprintedgbpen
1条回答
网友
1楼 · 发布于 2024-10-03 09:07:59

您需要过滤val的值

key, val = lines.split(",")
d.setdefault(key,[]).append(val.lower())

要从字符串中筛选出数字,请尝试

key, val = lines.split(",")
val = ''.join(letter for letter in val if not letter.isdigit())  # filter out digit chars
d.setdefault(key,[]).append(val.lower())

这将对提取的每个val字符串执行列表理解,并连接所有过滤的字符。不是一个非常有效的解决方案,但应该适合您的需要。你知道吗

相关问题 更多 >