从损坏的fi中搜索文本的算法

2024-09-28 20:59:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我必须从一个被破坏的文本文件中搜索某些标签,因为文件被破坏了,数据被改变了(一些字符被删除,一些被修改)。 例如,我必须搜索标记->;“页数”

文本文件1:

BHASKAR RAO MUKKU (57)Abstract In this system 2 pedal rods with pedals, one side balls based axle, hollowed secondary axle, counter axle, two splined gear wheels which has two clutch pin holes on circular pitch, two splined gear wheels which has ratchet gears on circular pitch, sprocket wheel, four clutch pins and a liver used to convert the ordinary bicycle into the gear bicycle. Number page : 10

文本文件2:

BHASKAR RAO MUKKU (57)Abstract In this system 2 pedal rods with pedals, one side balls based axle, hollowed secondary axle, counter axle, two splined gear wheels which has two clutch pin holes on circular pitch, two splined gear wheels which has ratchet gears on circular pitch, sprocket wheel, four clutch pins and a liver used to convert the ordinary bicycle into the gear bicycle. No. of pages: 10

文本文件3:

BHASKAR RAO MUKKU (57)Abstract In this system 2 pedal rods with pedals, one side balls based axle, hollowed secondary axle, counter axle, two splined gear wheels which has two clutch pin holes on circular pitch, two splined gear wheels which has ratchet gears on circular pitch, sprocket wheel, four clutch pins and a liver used to convert the ordinary bicycle into the gear bicycle. No of pages: 10

上面是一些文本文件的示例。如您所见,在上述所有文件中,单词NUMBER已被修改为三种不同的形式,现在对于所有这3个文件,我的代码必须输出相应的粗体单词。你知道吗

到目前为止,我一直试图从文本文件中找到标记和连续字符串之间最长的公共子序列(长度几乎等于标记的长度),然后计算匹配字符的百分比,如果该百分比大于85,我的代码将输出该连续字符串。你知道吗

我的代码

def lcs(S,T):
 m = len(S)
 n = len(T)
 counter = [[0]*(n+1) for x in range(m+1)]
 longest = 0
 lcs_set = set()
 for i in range(m):
    for j in range(n):
        if S[i] == T[j]:
            counter[i+1][j+1] = counter[i][j]+1
        else:
            counter[i+1][j+1]=max(counter[i+1][j],counter[i][j+1])        
 return counter[m][n] 
def match(word,tag):
  word=modify(word)
  tag=modify(tag)
  sq=lcs(word,tag)
  return(float(float(sq)/float(max(len(word),len(tag)))))
i=0
start=end=0 #records position of the matched tag in string
p=0.85 #percentage  
while i <len(string):   #string contains the text file
  j=i
  while j <i+len(tag)+7:#tag is the tag we want to search
    arr=match(string[i:j+1],tag)
    #print(str(p)+" "+str(arr)+' '+string[i:j+1]+' '+str(i))
    if (arr>p):
      p=arr
      start=i
      end=j 
    elif(p==arr):
      p=arr
      if(end-start>=j-i):
        start=i
        end=j 
    j+=1
  i+=1    

但是在许多情况下,比如文本文件1,这种代码都会失败。有没有其他方法可以更准确、更有效地搜索。你知道吗


Tags: thewhichontagcounterwheelsgearhas