从两个单独的列表中提取信息

2024-09-30 04:35:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我想用python从一个大文件中提取某些信息。 我有3个输入文件。 第一个输入文件(input\文件)是数据文件,它是一个3列制表符分隔的文件,如下所示:

engineer-n imposition-n 2.82169386609e-05
motor-n imposition-n 0.000102011705117
creature-n imposition-n 0.000121321951973
bomb-n imposition-n 0.000680302090112
sedation-n oppression-n 0.000397074586994
roadblock-n oppression-n 5.96190620847e-05
liability-n oppression-n 0.012845281978
currency-n oppression-n 0.000793989880202

第二个输入文件(colA\u文件)是一个单列列表,如下所示:

bomb-n
sedation-n
roadblock-n
surrender-n

第三个输入文件(colB\u文件)也是一个1列列表(与具有不同信息的colA\u文件相同),如下所示:

adjective-n
homeless-n
imposition-n
oppression-n

我想从colA和colB中找到的输入文件中提取信息。 对于我提供的示例数据,这意味着过滤除以下行以外的所有信息:

bomb-n imposition-n 0.000680302090112
sedation-n oppression-n 0.000397074586994
roadblock-n oppression-n 5.96190620847e-05

我用Python编写了以下代码来解决此任务:

def test_fnc(input_file, colA_file, colB_file, output_file):
    nounA = []
    with open(colA_file, "rb") as opened_colA:
        for aLine in opened_colA:
            nounA.append(aLine.strip())
            #print nounA

    nounB = []
    with open(colB_file, "rb") as opened_colB:
        for bLine in opened_colB:
            nounB.append(bLine.strip())
            #print nounB

    with open(output_file, "wb") as outfile:
        with open(input_file, "rb") as opened_input:
            for cLine in opened_input:
                splitted_cLine = cLine.split()
                #print splitted_cLine
                if splitted_cLine[0] in nounA and splitted_cLine[1] in nounB:
                    outstring = "\t".join(splitted_cLine)
                    outfile.write(outstring + "\n")

test_fnc(input_file, colA_file, colB_file, output_file)

但是,它只输出一行,就好像它不在所提供的列表输入上迭代一样。 我的列表似乎也被附加在一起,从一个项目开始,随着每个附加的项目而递增。 因此,我也试图参考以下清单:

    for bLine in opened_colB:
        nounB = bLine

与上述结果相同。你知道吗


Tags: 文件in信息列表inputfileopenedcolb
2条回答

如果您不介意依赖性,我会使用pandasnumpy。使用^{}可以对其列执行^{}检查。否则我建议使用集合,因为regex应该慢得多。像这样:

with open(colA_file, "rb") as file_h:
    noun_a = set(line.strip() for line in file_h)

with open(colB_file, "rb") as file_h:
    noun_b = set(line.strip() for line in file_h)

with open(output_file, "wb") as outfile:
    with open(input_file, "rb") as opened_input:
        for line in opened_input:
            split_line = line.split()
            if split_line[0] in noun_a and split_line[1] in noun_b:
                outfile.write(line)
import re

nounA=[]
with open('col1.txt', "rb") as opened_colA:
    for aLine in opened_colA:
        nounA.append(aLine.strip())

patterns = [r'\b%s\b' % re.escape(s.strip()) for s in nounA]
col1 = re.compile('|'.join(patterns))
nounB=[]
with open('col2.txt', "rb") as opened_colA:
    for aLine in opened_colA:
        nounB.append(aLine.strip())

patterns = [r'\b%s\b' % re.escape(s.strip()) for s in nounB]
col2 = re.compile('|'.join(patterns))

with open('test1.txt', "rb") as opened_colA:
    for aLine in opened_colA:
        if col1.search(aLine):
            if col2.search(aLine):
                print aLine

# just write aline to your output file.

解释:首先,我将colA中的所有单词取出来,并生成一个正则表达式;与col2类似。现在用这个正则表达式搜索输入文件并打印结果

'\b'是单词边界。如果您正在搜索一个单词'cat',但它可能会找到'catch''\b'很有用,因此只查找单词'cat'。你知道吗

相关问题 更多 >

    热门问题