从数据框中的句子中从两个列表中提取单词

2024-09-30 12:27:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我有熊猫数据框中的文本

我还有两个单词表。我想看看这些列表中的元素是否存在于一个句子中,并提取所有由冒号分隔的配对(如果没有配对也要提取)

乙二醇

   patternAnatomy="oesophagus|stomach|duodenum"
    patternEvent="clip|RFA|balloon|biopsy"

示例文本:

There was a need to place a clip in the oesophagus. One biopsy was taken. There is a long duodenum. The stomach had a balloon placed

应提取:oesophagus:clip,:biopsy,duodenum:,stomach:balloon

为了得到一个单独的句子,我已经试过了

nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer'))

def tokenizeAndList(text):
    
    if isinstance(text, str):
        doc = nlp(text)
        return [sent.string.strip() for sent in doc.sents]
    else:
        return text
        
Mypanda['findings2']=Mypanda['findings'].map(tokenizeAndList,na_action='ignore')

然后:

Mypanda['findings2'].apply(lambda row: row.findall("("+patternEvent+")",re.IGNORECASE))

但这失败了,而且无论如何只会搜索其中一个列表中的元素


Tags: text文本元素列表clipnlp句子there
2条回答

您可以使用函数,然后将其应用于数据帧:

text = 'There was a need to place a clip in the oesophagus. One biopsy was taken. There is a long duodenum. The stomach had a balloon placed'
patternAnatomy = "oesophagus|stomach|duodenum"
patternEvent = "clip|RFA|balloon|biopsy"

def split_text(text, patternAnatomy, patternEvent):
    s = [sentence.split() for sentence in text.split('.')]
    ana = patternAnatomy.split('|')
    eve = patternEvent.split('|')
    whitelist = ana + eve

    l = list()
    for sentence in s:
        l_ana = list()
        l_eve = list()
        for word in sentence:
            if word in ana:
                l_ana.append(word)
            if word in eve:
                l_eve.append(word)
        l.append([l_ana, l_eve])

    return ['_'.join(tup[0])+':'+'_'.join(tup[1]) for tup in l]

split_text(text, patternAnatomy, patternEvent)
# Out[14]: ['oesophagus:clip', ':biopsy', 'duodenum:', 'stomach:balloon']

最好提供s、ana、eve和白名单变量作为参数,而不是每次都计算它们

k=patternAnatomy+'|'+patternEvent
df['extract']=df['text'].str.findall(k)

相关问题 更多 >

    热门问题