Python文本提取

2条回答

网友

1楼 · 编辑于 2024-09-30 16:20:57

初步了解

根据你的例子，我认为：

文本以行提供。
示例文本似乎有太多的换行符，可能是它从DOS/Windows迁移的产物？如果是这样，则需要CRLF处理，或者应忽略备用行。你知道吗
线分为部分。
每个节在节的第一行的第0,1列中由一个两个字母的大写标记分隔，并一直持续到新的节的开始。你知道吗
每个行在第0-2列中有一个标记或两个空格，后跟一个空格。你知道吗
由标记ER分隔的人工部分标记记录的结束。你知道吗
ER部分不包含可用的文本。你知道吗

也可能出现以下情况：

记录由FN标记开始。你知道吗
在FN / ER对之外遇到的任何文本都可以忽略。你知道吗

建议设计

如果这是真的，我建议您使用该逻辑编写一个文本处理器：

读台词。你知道吗
处理CR/LF处理；或者跳过替换行；或者“不要担心真正的文本没有这些换行符”？你知道吗
使用状态数未知的状态机，初始状态为ER。你知道吗
特殊规则：忽略处于ER状态的文本，直到遇到FN行。你知道吗
一般规则：当看到一个标记时，结束以前的状态并开始一个以看到的标记命名的新状态。任何累积的文本都会添加到记录中。你知道吗
如果看不到标记，则在上一个标记中累积文本。你知道吗
特殊规则：当进入ER状态时，将累计记录添加到累计记录列表中。你知道吗

在这个过程结束时，您将有一个记录列表，其中包含各种累积的标记。然后可以用各种方式处理标记。你知道吗

像这样：

from warnings import warn

Debug = True

def read_lines_from(file):
    """Read and split lines from file. This is a separate function, instead
       of just using file.readlines(), in case extra work is needed like
       dos-to-unix conversion inside a unix environment.
    """
    with open(file) as f:
        text = f.read()
        lines = text.split('\n')

    return lines

def parse_file(file):
    """Parse file in format given by 
        https://stackoverflow.com/questions/54520331
    """
    lines = read_lines_from(file)
    state = 'ER'
    records = []
    current = None

    for line_no, line in enumerate(lines):
        tag, rest = line[:2], line[3:]

        if Debug:
            print(F"State: {state}, Tag: {tag}, Rest: {rest}")

        # Skip empty lines
        if tag == '':
            if Debug:
                print(F"Skip empty line at {line_no}")
            continue

        if tag == '  ':
            # Append text, except in ER state.
            if state != 'ER':
                if Debug:
                    print(F"Append text to {state}: {rest}")
                current[state].append(rest)
            continue

        # Found a tag. Process it.

        if tag == 'ER':
            if Debug:
                print("Tag 'ER'. Completed record:")
                print(current)

            records.append(current)
            current = None
            state = tag
            continue

        if tag == 'FN':
            if state != 'ER':
                warn(F"Found 'FN' tag without previous 'ER' at line {line_no}")
                if len(current.keys()):
                    warn(F"Previous record (FN:{current['FN']}) discarded.")

            if Debug:
                print("Tag 'FN'. Create empty record.")

            current = {}

        # All tags except ER get this:
        if Debug:
            print(F"Tag '{tag}'. Create list with rest: {rest}")

        current[tag] = [rest]
        state = tag

    return records

if __name__ == '__main__':
    records = parse_file('input.txt')
    print('Records =', records)

网友
2楼 · 编辑于 2024-09-30 16:20:57

在解析文件时，需要跟踪所处的节。有更简洁的方法来编写状态机，但是作为一个快速简单的示例，您可以执行以下操作。你知道吗
基本上，将每个部分的所有行添加到该部分的列表中，然后合并列表并在末尾执行任何操作。注意，我没有测试这个，只是用psuedo编码来告诉你大概的想法。你知道吗
authors = [] title = [] section = None for line in articles: line = line.strip() # Check for start of new section, select the right list to add to if line.startswith("AU"): line = line[3:] section = authors elif line.startswith("TI"): line = line[3:] section = title # Other sections.. ... # Add line to the current section if line and section is not None: section.append(line) authors_str = ', '.join(authors) title_str = ' '.join(title) print authors_str, title_str

相关问题更多 >

编程相关推荐

热门问题

热门文章