如何解析标记的文本以进行进一步处理？

3条回答

网友

1楼 · 编辑于 2024-09-25 00:23:57

因为您处理的是一个大纲情况，所以可以通过使用堆栈来简化事情。基本上，您需要创建一个堆栈，它的dict与轮廓的深度相对应。当您分析一个新行并且轮廓的深度增加时，您将一个新的dict推送到堆栈顶部的前一个{}所引用的堆栈上。当您分析一个深度较低的行时，您将弹出堆栈以返回父级。当您遇到一条具有相同深度的线时，将其添加到堆栈顶部的dict。在

网友

2楼 · 编辑于 2024-09-25 00:23:57

编辑：由于规范中的澄清和更改，我编辑了我的代码，仍然使用显式的Node类作为中间步骤，以清晰明了——逻辑是将行列表转换为节点列表，然后将该节点列表转换为树（通过适当地使用它们的indent属性），然后打印该树以可读的形式（这只是一个“调试帮助”步骤，检查树是否构造良好，当然可以在脚本的最终版本中被注释掉——当然，这将从文件中获取行，而不是将它们硬编码用于调试！-)，最后构建所需的Python结构并将其打印出来。下面是代码，之后我们将看到，结果是几乎，正如OP指定的，只有一个例外——但是，代码首先：

import sys

class Node(object):
  def __init__(self, title, indent):
    self.title = title
    self.indent = indent
    self.children = []
    self.notes = []
    self.parent = None
  def __repr__(self):
    return 'Node(%s, %s, %r, %s)' % (
        self.indent, self.parent, self.title, self.notes)
  def aspython(self):
    result = dict(title=self.title, children=topython(self.children))
    if self.notes:
      result['notes'] = self.notes
    return result

def print_tree(node):
  print ' ' * node.indent, node.title
  for subnode in node.children:
    print_tree(subnode)
  for note in node.notes:
    print ' ' * node.indent, 'Note:', note

def topython(nodelist):
  return [node.aspython() for node in nodelist]

def lines_to_tree(lines):
  nodes = []
  for line in lines:
    indent = len(line) - len(line.lstrip())
    marker, body = line.strip().split(None, 1)
    if marker == '*':
      nodes.append(Node(body, indent))
    elif marker == '-':
      nodes[-1].notes.append(body)
    else:
      print>>sys.stderr, "Invalid marker %r" % marker

  tree = Node('', -1)
  curr = tree
  for node in nodes:
    while node.indent <= curr.indent:
      curr = curr.parent
    node.parent = curr
    curr.children.append(node)
    curr = node

  return tree


data = """\
* 1
 * 1.1
 * 1.2
  - Note for 1.2
* 2
* 3
- Note for root
""".splitlines()

def main():
  tree = lines_to_tree(data)
  print_tree(tree)
  print
  alist = topython(tree.children)
  print alist

if __name__ == '__main__':
  main()

运行时，会发出：

^{pr2}$

除了键的顺序（当然，这是不重要的，在dict中也不保证），这几乎是按要求的除了这里所有的注释都显示为dict条目，其键为notes，值为字符串列表（但是如果列表为空，则忽略notes条目，大致如问题中的示例）。在

在当前版本的问题中，如何表示注释有点不清楚；一个注释显示为独立字符串，其他注释显示为值为字符串的条目（而不是我使用的字符串列表）。目前还不清楚，在一种情况下，注释必须以独立字符串的形式出现，而在所有其他情况下都必须显示为dict条目，所以我使用的这个方案更为常规；如果一个注释（如果有的话）是一个字符串而不是一个列表，那么如果一个节点出现多个注释，这是否意味着这是一个错误？在后一方面，我使用的这个方案更通用（让一个节点有0到1的任意数量的注释，而不是问题中明显暗示的0或1）。在

写了这么多代码（预编辑的答案差不多长，有助于澄清和更改规格）提供（我希望）99%的理想解决方案，我希望这能满足原来的海报，因为最后几次代码和/或规格的调整，使他们彼此匹配，他应该很容易做到！在

网友
3楼 · 编辑于 2024-09-25 00:23:57

堆栈是解析树时非常有用的数据结构。您只需始终保留从最后添加的节点到堆栈根的路径，这样就可以根据缩进的长度找到正确的父节点。类似这样的代码应该适用于分析上一个示例：

import re
line_tokens = re.compile('( *)(\\*|-) (.*)')

def parse_tree(data):
    stack = [{'title': 'Root node', 'children': []}]
    for line in data.split("\n"):
        indent, symbol, content = line_tokens.match(line).groups()        
        while len(indent) + 1 < len(stack):
            stack.pop() # Remove everything up to current parent
        if symbol == '-':
            stack[-1].setdefault('notes', []).append(content)
        elif symbol == '*':
            node = {'title': content, 'children': []}
            stack[-1]['children'].append(node)
            stack.append(node) # Add as the current deepest node
    return stack[0]

相关问题更多 >

编程相关推荐

热门问题

热门文章