Python文本提取问题的回答

Python文本提取

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在用python进行文本提取。输出没有我想要的那么理想！你知道吗 我有一个包含如下信息的文本文件： <pre><code>FN Clarivate Analytics Web of Science VR 1.0 PT J AU Chen, G Gully, SM Whiteman, JA Kilcullen, RN AF Chen, G Gully, SM Whiteman, JA Kilcullen, RN TI Examination of relationships among trait-like individual differences, state-like individual differences, and learning performance SO JOURNAL OF APPLIED PSYCHOLOGY CT 13th Annual Conference of the Society-for-Industrial-and-Organizational-Psychology CY APR 24-26, 1998 CL DALLAS, TEXAS SP Soc Ind & Org Psychol RI Gully, Stanley/D-1302-2012 OI Gully, Stanley/0000-0003-4037-3883 SN 0021-9010 PD DEC PY 2000 VL 85 IS 6 BP 835 EP 847 DI 10.1037//0021-9010.85.6.835 UT WOS:000165745400001 PM 11125649 ER </code></pre> 当我像这样使用代码时 <pre><code>import random import sys filepath = "data\jap_2000-2001-plain.txt" with open(filepath) as f: articles = f.read().strip().split("\n") articles_list = [] author = "" title = "" year = "" doi = "" for article in articles: if "AU" in article: author = article.split("#")[-1] if "TI" in article: title = article.split("#")[-1] if "PY" in article: year = article.split("#")[-1] if "DI" in article: doi = article.split("#")[-1] if article == "ER#": articles_list.append("{}, {}, {}, https://doi.org/{}".format(author, title, year, doi)) print("Oh hello sir, how many articles do you like to get?") amount = input() random_articles = random.sample(articles_list, k = int(amount)) for i in random_articles: print(i) print("\n") exit = input('Please enter exit to exit: \n') if exit in ['exit','Exit']: print("Goodbye sir!") sys.exit() </code></pre> 提取不包括在换行符之后输入的数据，如果我运行这段代码，输出将看起来像“AU Chen，G”，并且不包括其他名称，与标题等相同 我的输出如下所示： 陈庚。性状间关系的检验，2000，doi.dx文件.10.1037//0021-9010.85.6.835 所需输出应为： Chen，G.，Gully，SM.，Whiteman，JA.，Kilcullen，RN.，2000，特质性个体差异、状态性个体差异与学习绩效之间关系的研究，doi.dx文件.10.1037//0021-9010.85.6.835 但是提取只包括每行的第一行- 有什么建议吗？你知道吗

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

初步了解 根据你的例子，我认为： <ul> <li>文本以行提供。</li> <li>示例文本似乎有太多的换行符，可能是它从DOS/Windows迁移的产物？如果是这样，则需要CRLF处理，或者应忽略备用行。你知道吗</li> <li>线分为部分。</li> <li>每个节在节的第一行的第0,1列中由一个两个字母的大写标记分隔，并一直持续到新的节的开始。你知道吗</li> <li>每个行在第0-2列中有一个标记或两个空格，后跟一个空格。你知道吗</li> <li>由标记<code>ER</code>分隔的人工部分标记记录的结束。你知道吗</li> <li><code>ER</code>部分不包含可用的文本。你知道吗</li> </ul> 也可能出现以下情况： <ul> <li>记录由<code>FN</code>标记开始。你知道吗</li> <li>在<code>FN / ER</code>对之外遇到的任何文本都可以忽略。你知道吗</li> </ul> 建议设计 如果这是真的，我建议您使用该逻辑编写一个文本处理器： <ul> <li>读台词。你知道吗</li> <li>处理CR/LF处理；或者跳过替换行；或者“不要担心真正的文本没有这些换行符”？你知道吗</li> <li>使用状态数未知的状态机，初始状态为<code>ER</code>。你知道吗</li> <li>特殊规则：忽略处于<code>ER</code>状态的文本，直到遇到<code>FN</code>行。你知道吗</li> <li>一般规则：当看到一个标记时，结束以前的状态并开始一个以看到的标记命名的新状态。任何累积的文本都会添加到记录中。你知道吗</li> <li>如果看不到标记，则在上一个标记中累积文本。你知道吗</li> <li>特殊规则：当进入<code>ER</code>状态时，将累计记录添加到累计记录列表中。你知道吗</li> </ul> 在这个过程结束时，您将有一个记录列表，其中包含各种累积的标记。然后可以用各种方式处理标记。你知道吗 像这样： <pre><code>from warnings import warn Debug = True def read_lines_from(file): """Read and split lines from file. This is a separate function, instead of just using file.readlines(), in case extra work is needed like dos-to-unix conversion inside a unix environment. """ with open(file) as f: text = f.read() lines = text.split('\n') return lines def parse_file(file): """Parse file in format given by https://stackoverflow.com/questions/54520331 """ lines = read_lines_from(file) state = 'ER' records = [] current = None for line_no, line in enumerate(lines): tag, rest = line[:2], line[3:] if Debug: print(F"State: {state}, Tag: {tag}, Rest: {rest}") # Skip empty lines if tag == '': if Debug: print(F"Skip empty line at {line_no}") continue if tag == ' ': # Append text, except in ER state. if state != 'ER': if Debug: print(F"Append text to {state}: {rest}") current[state].append(rest) continue # Found a tag. Process it. if tag == 'ER': if Debug: print("Tag 'ER'. Completed record:") print(current) records.append(current) current = None state = tag continue if tag == 'FN': if state != 'ER': warn(F"Found 'FN' tag without previous 'ER' at line {line_no}") if len(current.keys()): warn(F"Previous record (FN:{current['FN']}) discarded.") if Debug: print("Tag 'FN'. Create empty record.") current = {} # All tags except ER get this: if Debug: print(F"Tag '{tag}'. Create list with rest: {rest}") current[tag] = [rest] state = tag return records if __name__ == '__main__': records = parse_file('input.txt') print('Records =', records) </code></pre>

Python文本提取

1 个回答

相关Python问题