编写和查询以在数据集中查找匹配的文档(python)

2024-09-26 17:56:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试构造一个名为“and\u query”的函数,该函数将一个由一个或多个单词组成的字符串作为输入,以便该函数返回文档摘要中单词的匹配文档列表。你知道吗

首先,我把所有的单词放在一个倒排索引中,id是文档的id,抽象是纯文本。你知道吗

inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
    inverted_index[term].add(id)

然后,我编写了一个查询函数,其中finals是所有匹配文档的列表。你知道吗

因为它应该只返回函数参数的每个字在文档中都匹配的文档,所以我使用了set操作“intersecton”。你知道吗

def and_query(tokens):
    documents=set()
    finals = []
    terms = preprocess(tokenize(tokens))

    for term in terms:
        for i in inverted_index[term]:
            documents.add(i)

    for term in terms:
        temporary_set= set()
        for i in inverted_index[term]:
            temporary_set.add(i)
        finals.extend(documents.intersection(temporary_set))
    return finals

def finals_print(finals):
    for final in finals:
        display_summary(final)        

finals_print(and_query("netherlands vaccine trial"))

但是,函数似乎仍在返回文档摘要中只有一个单词的文档。你知道吗

有人知道我在集合操作上做错了什么吗??你知道吗

(我认为错误应该出现在代码的这一部分):

for term in terms:
    temporary_set= set()
    for i in inverted_index[term]:
        temporary_set.add(i)
    finals.extend(documents.intersection(temporary_set))
return finals 

提前谢谢

简而言之,我想做的就是:

for word in words:
    id_set_for_one_word= set()
    for  i  in  get_id_of that_word[word]:
        id_set_for_one_word.add(i)
pseudo:
            id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)

然后我需要所有这些单词的id集的交集,返回一个集合,其中id是单词中每个单词的id。你知道吗


Tags: 函数in文档addidforindex单词
3条回答

为了详细说明我的代码注释,这里是我以前为解决这类问题所做的工作的大致草稿。你知道吗

def tokenize(abstract):
    #return <set of words in abstract>
    set_ = .....
    return set_

candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


all_criterias = "netherlands vaccine trial".split()


def searcher(candidates, criteria, match_on_found=True):

    search_results = []
    for cand in candidates:
        #cand[2] has a set of tokens or somesuch...  abstract.
        if criteria in cand[2]:
            if match_on_found:
                search_results.append(cand)
            else:
                #that's a AND NOT if you wanted that
                search_results.append(cand)
    return search_results


for criteria in all_criterias:
    #pass in the full list every time, but it gets progressively shrunk
    candidates = searcher(candidates, criteria)

#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates] 

Question: returns a list of matching documents for the words being in the abstracts of the documents

具有min个数documentsterm始终保持result
如果terminverted_index中不存在,则根本不提供匹配。你知道吗

为简单起见,预定义数据:

Abstracts = {1: 'Lorem ipsum dolor sit amet,',
             2: 'consetetur sadipscing elitr,',
             3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
             4: 'sed diam voluptua.',
             5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
             6: 'Stet clita kasd gubergren,',
             7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',
            }


inverted_index = {'Stet': {6}, 'ipsum': {1, 7}, 'erat,': {3}, 'ut': {3}, 'dolores': {5}, 'gubergren,': {6}, 'kasd': {6}, 'ea': {5}, 'consetetur': {2}, 'sit': {1, 7}, 'nonumy': {3}, 'voluptua.': {4}, 'est': {7}, 'elitr,': {2}, 'At': {5}, 'rebum.': {5}, 'magna': {3}, 'sadipscing': {2}, 'diam': {3, 4}, 'dolore': {3}, 'sanctus': {7}, 'labore': {3}, 'sed': {3, 4}, 'takimata': {7}, 'Lorem': {1, 7}, 'invidunt': {3}, 'aliquyam': {3}, 'accusam': {5}, 'duo': {5}, 'amet.': {7}, 'et': {3, 5}, 'sea': {7}, 'dolor': {1, 7}, 'vero': {5}, 'no': {7}, 'eos': {5}, 'tempor': {3}, 'amet,': {1}, 'clita': {6}, 'justo': {5}, 'eirmod': {3}}

def and_query(tokens):
    print("tokens:{}".format(tokens))
    #terms = preprocess(tokenize(tokens))
    terms = tokens.split()

    term_min = None
    for term in terms:
        if term in inverted_index:
            # Find min
            if not term_min or term_min[0] > len(inverted_index[term]):
                term_min = (len(inverted_index[term]), term)
        else:
            # Break early, if a term is not in inverted_index
            return set()

    finals = inverted_index[term_min[1]]
    print("term_min:{} inverted_index:{}".format(term_min, finals))
    return finals


def finals_print(finals):
    if finals:
        for final in finals:
            print("Document [{}]:{}".format(final, Abstracts[final]))
    else:
        print("No matching Document found")

if __name__ == "__main__":
    for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
        finals_print(and_query(tokens))
        print()

Output:

tokens:sed diam voluptua.
term_min:(1, 'voluptua.') inverted_index:{4}
Document [4]:sed diam voluptua.

tokens:Lorem ipsum dolor
term_min:(2, 'Lorem') inverted_index:{1, 7}
Document [1]:Lorem ipsum dolor sit amet,
Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

tokens:Lorem ipsum dolor test
No matching Document found

用Python:3.4.2测试

最终我自己找到了解决办法。 更换

    finals.extend(documents.intersection(id_set_for_one_word))
return finals 

    documents = (documents.intersection(id_set_for_one_word))
return documents

好像在这里工作。你知道吗

不过,谢谢你们的努力。你知道吗

相关问题 更多 >

    热门问题