Spacy短语匹配器不匹配一个关键字，而是匹配字符串中的所有关键字

dfArticles['Text'] Out[2]: 0 (Reuters) - Major Middle Eastern markets ended... 1 MIDEAST STOCKS-Oil price fall hurts major Gulf... 2 DUBAI, 21st September, 2020 (WAM) -- The Minis... 3 DUBAI, (UrduPoint / Pakistan Point News / WAM ... 4 Brent crude was down 99 cents or 2.1% at $42.2.

dfTopics Out[3]: Topic Keywords 0 Regulations law 1 Regulations regulatory 2 Regulations regulation 3 Regulations legislation 4 Regulations rules 5 Talent capability 6 Talent workforce

def prep_match_patterns(dfTopics): matcher = PhraseMatcher(nlp.vocab, attr="LOWER") for topic in dfTopics['Topic'].unique(): keywords = dfTopics.loc[dfTopics['Topic'] == topic, 'Keywords'].to_list() patterns_topic = [nlp.make_doc(text) for text in keywords] matcher.add(topic, None, *patterns_topic) return matcher

nlp = spacy.load("en_core_web_lg") nlp.disable_pipes(["parser"]) # extract the sentences from the documents nlp.add_pipe(nlp.create_pipe('sentencizer')) matcher = prep_match_patterns(dfTopics) dfResults = pd.DataFrame([],columns=['ArticleID', 'Topic']) articles = [] topics = [] for index, row in tqdm(dfArticles.iterrows(), total=len(dfArticles)): doc = nlp(row['Text']) matches = matcher(doc) if len(matches)<1: continue else: for match_id, start, end in matches: string_id = nlp.vocab.strings[match_id] # Get string representation articles.append(row['ID']) topics.append(string_id) dfResults['ArticleID'] = articles dfResults['Topic'] = topics dfResults.drop_duplicates(inplace=True)

1条回答

网友

1楼 · 发布于 2024-10-02 08:24:51

我觉得你太复杂了。您可以使用简单的python实现您想要的

假设我们有：

df_topics
    Topic   Keywords
0   Regulations law
1   Regulations regulatory
2   Regulations regulation
3   Regulations legislation
4   Regulations rules
5   Talent  capability
6   Talent  workforce

然后，您可以将主题关键字组织到词典中：

topics = df_topics.groupby("Topic")["Keywords"].agg(lambda x: x.to_list()).to_dict()
topics
{'Regulations': ['law', 'regulatory', 'regulation', 'legislation', 'rules'],
 'Talent': ['capability', 'workforce']}

最后，定义一个func以匹配关键字：

def textToTopic(text, topics):
    t = []
    for k,v in topics.items():
        if all([topic in text.split() for topic in v]):
            t.append(k)
    return t

演示：

textToTopic("law regulatory regulation rules legislation workforce", topics)
['Regulations']

textToTopic("law regulatory regulation rules legislation workforce capability", topics)
['Regulations', 'Talent']

您可以将此函数应用于df中的文本

相关问题更多 >

编程相关推荐

热门问题

热门文章