最长公共序列群

2024-10-02 00:32:03 发布

您现在位置:Python中文网/ 问答频道 /正文

给出以下几行文字

TOKYO-BLING.1 H02-AVAILABLE
TOKYO-BLING.1 H02-MIDDLING
TOKYO-BLING.1 H02-TOP
TOKYO-BLING.2 H04-USED
TOKYO-BLING.2 H04-AVAILABLE
TOKYO-BLING.2 H04-CANCELLED
WAY-VERING.1 H03-TOP
WAY-VERING.2 H03-USED
WAY-VERING.2 H03-AVAILABLE
WAY-VERING.1 H03-CANCELLED

我想做一些解析来生成一些合理的分组。上面的列表可以按如下方式分组

^{pr2}$

有谁能提出一种算法(或某种方法),可以扫描给定数量的文本,并得出文本可以按上述方式分组吗。显然每个小组都可以走得更远。我想我正在寻找一个好的解决方案来查看一个短语列表,并找出如何最好地按一些常见的字符串序列对它们进行分组。在


Tags: 文本列表top方式wayusedavailable文字
3条回答

有一种方法:

  1. 对条目进行排序
  2. 确定每个条目之间公共前缀的长度
  3. 通过在公共前缀比前一个条目短的位置分隔列表,将条目分组

实施示例:

def common_count(t0, t1):
  "returns the length of the longest common prefix"
  for i, pair in enumerate(zip(t0, t1)):
    if pair[0] != pair[1]:
      return i
  return i

def group_by_longest_prefix(iterable):
  "given a sorted list of strings, group by longest common prefix"
  longest = 0
  out = []

  for t in iterable:
    if out: # if there are previous entries 

      # determine length of prefix in common with previous line
      common = common_count(t, out[-1])

      # if the current entry has a shorted prefix, output previous 
      # entries as a group then start a new group
      if common < longest:
        yield out
        longest = 0
        out = []
      # otherwise, just update the target prefix length
      else:
        longest = common

    # add the current entry to the group
    out.append(t)

  # return remaining entries as the last group
  if out:
    yield out

用法示例:

^{pr2}$

这会产生:

['TOKYO-BLING.1 H02-AVAILABLE', 'TOKYO-BLING.1 H02-MIDDLING', 'TOKYO-BLING.1 H02-TOP']
['TOKYO-BLING.2 H04-AVAILABLE', 'TOKYO-BLING.2 H04-CANCELLED', 'TOKYO-BLING.2 H04-USED']
['WAY-VERING.1 H03-CANCELLED', 'WAY-VERING.1 H03-TOP']
['WAY-VERING.2 H03-AVAILABLE', 'WAY-VERING.2 H03-USED']

在这里看到它的作用:http://ideone.com/1Da0S

您可以用空格分隔每个字符串,然后生成一个dict。在

我是这样做的:

f = open( 'hotels.txt', 'r' )   # read the data
f = f.readlines()               # convert to a list of strings (with newlines)
f = [ i.strip() for i in f ]    # take off the newlines
h = [ i.split(' ') for i in f ] # split using whitespace
                                # now h is a list of lists of strings

keys = [ i[0] for i in h ]      # keys = ['TOKYO-BLING.1','TOKYO-BLING.1',...]
keys = list( set( keys ) )      # take out redundant elements

d = dict()                      # start a dict
for i in keys:                  # initialize dict with empty lists
    d[i] = list()               # (one for each key)

for i in h:                     # for each list in h, append a suffix
    d[i[0]].append(i[1])        # to the appropriate prefix (or key)

这会产生:

^{pr2}$

相关问题 更多 >

    热门问题