如何在python中对wikipedia类别进行分组?

2024-09-30 00:27:27 发布

您现在位置:Python中文网/ 问答频道 /正文

对于数据集的每个概念,我都存储了相应的wikipedia类别。例如,考虑以下5个概念及其对应的wikipedia类别。在

  • 高甘油三酯血症:['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
  • 酶抑制剂:['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
  • 旁路手术:['Category:Surgery stubs', 'Category:Surgical procedures and techniques']
  • 珀斯:['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']
  • 气候:['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']

如您所见,前三个概念属于医学领域(而其余两个术语不是医学术语)。在

更确切地说,我想把我的概念分为医学和非医学。然而,仅仅用范畴来划分概念是非常困难的。例如,尽管enzyme inhibitor和{}这两个概念属于医学领域,但它们的范畴却大不相同。在

因此,我想知道是否有方法获得类别的parent category(例如,enzyme inhibitor和{}的类别属于medical父类别)

我当前正在使用pymediawikipywikibot。然而,我并不局限于这两个库,我也很高兴有解决方案使用其他库。在

编辑

根据@IlmariKaronen的建议,我也使用了categories of categories,结果如下(靠近category的小字体是categories of the category)。 enter image description here

然而,我仍然无法找到一种方法来使用这些类别的细节来决定一个给定的术语是医学的还是非医学的。在

此外,正如@IlmariKaronen所指出的那样,使用Wikiproject的细节可能是潜在的。然而,似乎Medicinewikiproject似乎没有所有的医学术语。因此,我们还需要检查其他wikiprojects。在

编辑: 我当前从wikipedia概念中提取类别的代码如下。可以使用pywikibotpymediawiki来完成此操作,如下所示。在

  1. 使用库pymediawiki

    以pw格式导入mediawiki

    p = wikipedia.page('enzyme inhibitor')
    print(p.categories)
    
  2. 使用库pywikibot

    import pywikibot as pw
    
    site = pw.Site('en', 'wikipedia')
    
    print([
        cat.title()
        for cat in pw.Page(site, 'support-vector machine').categories()
        if 'hidden' not in cat.categoryinfo
    ])
    

类别的分类也可以按照@IlmariKaronen的答案中所示的方式进行。在

如果您正在寻找一个较长的测试概念列表,我已经在下面提到了更多的例子。在

['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']

对于一个很长的列表,请检查下面的链接。https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing

注意:我不希望解决方案100%有效(如果所提出的算法能够检测到对我来说足够多的医学概念)

如果需要,我很乐意提供更多细节。在


Tags: ofin概念wikipedia类别医学catcategories
3条回答

您可以尝试根据mediawiki链接和为每个类别返回的反向链接对wikipedia类别进行分类

import re
from mediawiki import MediaWiki

#TermFind will search through a list a given term
def TermFind(term,termList):
    responce=False
    for val in termList:
        if re.match('(.*)'+term+'(.*)',val):
            responce=True
            break
    return responce

#Find if the links and backlinks lists contains a given term 
def BoundedTerm(wikiPage,term):
    aList=wikiPage.links
    bList=wikiPage.backlinks
    responce=False
    if TermFind(term,aList)==True and TermFind(term,bList)==True:
         responce=True
    return responce

container=[]
wikipedia = MediaWiki()
for val in termlist:
    cpage=wikipedia.page(val)
    if BoundedTerm(cpage,'term')==True:
        container.append('medical')
    else:
        container.append('nonmedical')

我的想法是试着猜测一个大多数类别都共享的术语,我尝试生物学、医学和疾病,结果很好。也许您可以尝试使用BoundedTerms的多个调用来进行分类,或者对多个术语进行一次调用并将结果组合起来进行分类。希望有帮助

"Therefore, I would like to know if there is a way to obtain the parent category of the categories (for example, the categories of enzyme inhibitor and bypass surgery belong to medical parent category)"

MediaWiki类别本身就是wiki页面。“父类别”只是“子类别”页面所属的类别。因此,您可以以与获取任何其他wiki页面的类别完全相同的方式获取类别的父类别。在

{例如,使用^ a1:

p = wikipedia.page('Category:Enzyme inhibitors')
parents = p.categories

解决方案概述

好吧,我会从多个方向来解决这个问题。这里有一些很好的建议,如果我是你的话,我会使用这些方法的集合(多数投票,预测标签,在你的二进制案例中,超过50%的分类器同意这一点)。在

我正在考虑以下方法:

  • 主动学习(下面由我提供的示例方法)
  • MediaWiki backlinks由{a2}作为答案提供
  • SPARQL祖先类别由@Stanislav Kralin和/或parent categories提供,作为对您问题的注释(这两个类别可以根据它们的差异单独组成一个集合,但为此您必须联系两个创建者并比较它们的结果)。在

这样,三分之二的人必须同意某个概念是医学概念,这将进一步减少出错的可能性。在

当我们讨论这个问题时,我会反对@ananand_v.singhthis answer中提出的方法,因为:

  • 距离度量不应该是欧式的,余弦相似度是更好的度量(例如,spaCy)因为它没有考虑向量的大小(也不应该,这就是word2vec或GloVe是如何训练的)
  • 如果我理解正确的话,很多人造集群会被创造出来,而我们只需要两个:药物和非药物。此外,医学的重心并不是以药物本身为中心。这会带来额外的问题,比如说质心远离医学,其他的词,比如说,computer或{}(或者任何其他不适合你的医学观点的词)可能会进入到聚类中。在
  • 很难评估结果,更重要的是,这件事完全是主观的。此外,单词向量很难被可视化和理解(使用PCA/TSNE/相似的方法将它们转换成低维[2D/3D],对于这么多的单词,会给我们完全不敏感的结果[是的,我已经尝试过了,PCA对于你较长的数据集得到了大约5%的解释方差,真的非常低])。在

基于上面强调的问题,我提出了使用active learning的解决方案,这是一种很容易被遗忘的解决此类问题的方法。在

主动学习法

在机器学习的这个子集中,当我们很难想出一个精确的算法(比如一个术语是medical类别的一部分意味着什么),我们会要求人类的“专家”(实际上并不一定是专家)来提供一些答案。在

知识编码

正如anand_v.singh所指出的,单词向量是最有前途的方法之一,我在这里也将使用它(尽管与此不同,IMO以一种更干净、更简单的方式)。在

我不打算在我的回答中重复他的观点,所以我要加上我的两分钱:

  • 不要将上下文化的单词嵌入作为当前可用的技术(例如BERT
  • 检查有多少概念没有表示(例如,表示为零向量)。它应该被检查(并且在我的代码中被检查过,到时候会有进一步的讨论),并且你可以使用嵌入,它包含了大部分内容。在

使用spaCy

此类度量编码为spaCy的手套词向量的medicine与其他概念之间的相似性。在

class Similarity:
    def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
        # In our case it will be medicine
        self.centroid = centroid

        # spaCy's Language model (english), which will be used to return similarity to
        # centroid of each concept
        self.nlp = nlp
        self.n_threads: int = n_threads
        self.batch_size: int = batch_size

        self.missing: typing.List[int] = []

    def __call__(self, concepts):
        concepts_similarity = []
        # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
        for i, concept in enumerate(
            self.nlp.pipe(
                concepts, n_threads=self.n_threads, batch_size=self.batch_size
            )
        ):
            if concept.has_vector:
                concepts_similarity.append(self.centroid.similarity(concept))
            else:
                # If document has no vector, it's assumed to be totally dissimilar to centroid
                concepts_similarity.append(-1)
                self.missing.append(i)

        return np.array(concepts_similarity)

此代码将为每个概念返回一个数字,以测量它与质心的相似程度。此外,它还记录了缺少表示的概念的索引。可以这样称呼:

^{pr2}$

您可以用数据代替new_concepts.json。在

看一下spacy.load,注意我使用了^{}。它由685.000个独特的词向量组成(这是一个很大的量),并且可能会为您的案例提供现成的解决方案。安装s后必须单独下载paCy,更多信息在上面的链接中提供。在

另外,您可能需要使用多个质心词,例如添加disease或{}等单词,并对其词向量求平均值。但我不确定这是否会对你的案子产生积极影响。在

另一种可能性是使用多个质心,并计算每个概念与多个质心之间的相似性。在这种情况下,我们可能有一些阈值,这可能会删除一些false positives,但可能会遗漏一些可以认为与medicine相似的术语。此外,这会使情况复杂得多,但如果你的结果不令人满意,你应该考虑以上两种选择(只有在这些都是,不要在没有事先考虑的情况下才开始采用这种方法)。在

现在,我们有了一个概念相似性的粗略度量。但是,某个概念与医学有0.1的正相似性意味着什么?这是一个应该归为医学的概念吗?或者可能已经太远了?在

询问专家

为了得到一个阈值(低于它的术语将被认为是非医学的),最简单的方法是让人类为我们分类一些概念(这就是主动学习的意义所在)。是的,我知道这是一种非常简单的主动学习方式,但我还是会这么认为的。在

我编写了一个带有sklearn-like接口的类,要求人类对概念进行分类,直到达到最佳阈值(或最大迭代次数)。在

class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        max_steps: int,
        samples: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.max_steps: int = max_steps
        self.samples: int = samples
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1
  • samples参数描述在每次迭代过程中将向专家显示多少个示例(这是最大值,如果已经请求了样本或者没有足够的示例来显示,则返回的示例会更少)。在
  • step表示每个迭代中阈值的下降(我们从1开始表示完全相似)。在
  • change_multiplier-如果专家的答案是不相关的(或者大部分是不相关的,因为返回了多个概念),则步骤乘以这个浮点数。它用于在每次迭代中精确确定step变化之间的阈值。在
  • 概念根据其相似性进行排序(概念越相似,越高)

下面的函数向专家征求意见,并根据他的答案找到最佳阈值。在

def _ask_expert(self, available_concepts_indices):
    # Get random concepts (the ones above the threshold)
    concepts_to_show = set(
        np.random.choice(
            available_concepts_indices, len(available_concepts_indices)
        ).tolist()
    )
    # Remove those already presented to an expert
    concepts_to_show = concepts_to_show - self._checked_concepts
    self._checked_concepts.update(concepts_to_show)
    # Print message for an expert and concepts to be classified
    if concepts_to_show:
        print("\nAre those concepts related to medicine?\n")
        print(
            "\n".join(
                f"{i}. {concept}"
                for i, concept in enumerate(
                    self.concepts[list(concepts_to_show)[: self.samples]]
                )
            ),
            "\n",
        )
        return input("[y]es / [n]o / [any]quit ")
    return "y"

示例问题如下:

Are those concepts related to medicine?                                                      

0. anesthetic drug                                                                                                                                                                         
1. child and adolescent psychiatry                                                                                                                                                         
2. tertiary care center                                                     
3. sex therapy                           
4. drug design                                                                                                                                                                             
5. pain disorder                                                      
6. psychiatric rehabilitation                                                                                                                                                              
7. combined oral contraceptive                                
8. family practitioner committee                           
9. cancer family syndrome                          
10. social psychology                                                                                                                                                                      
11. drug sale                                                                                                           
12. blood system                                                                        

[y]es / [n]o / [any]quit y

。。。解析专家的答案:

# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
    if decision.lower() == "y":
        # You can't go higher as current threshold is related to medicine
        self._max_threshold = self.threshold_
        if self.threshold_ - self.step < self._min_threshold:
            return False
        # Lower the threshold
        self.threshold_ -= self.step
        return True
    if decision.lower() == "n":
        # You can't got lower than this, as current threshold is not related to medicine already
        self._min_threshold = self.threshold_
        # Multiply threshold to pinpoint exact spot
        self.step *= self.change_multiplier
        if self.threshold_ + self.step < self._max_threshold:
            return False
        # Lower the threshold
        self.threshold_ += self.step
        return True
    return False

最后是ActiveLearner的整个代码代码,根据专家的判断,找到最佳相似阈值:

class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        samples: int,
        max_steps: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.samples: int = samples
        self.max_steps: int = max_steps
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

    def _ask_expert(self, available_concepts_indices):
        # Get random concepts (the ones above the threshold)
        concepts_to_show = set(
            np.random.choice(
                available_concepts_indices, len(available_concepts_indices)
            ).tolist()
        )
        # Remove those already presented to an expert
        concepts_to_show = concepts_to_show - self._checked_concepts
        self._checked_concepts.update(concepts_to_show)
        # Print message for an expert and concepts to be classified
        if concepts_to_show:
            print("\nAre those concepts related to medicine?\n")
            print(
                "\n".join(
                    f"{i}. {concept}"
                    for i, concept in enumerate(
                        self.concepts[list(concepts_to_show)[: self.samples]]
                    )
                ),
                "\n",
            )
            return input("[y]es / [n]o / [any]quit ")
        return "y"

    # True - keep asking, False - stop the algorithm
    def _parse_expert_decision(self, decision) -> bool:
        if decision.lower() == "y":
            # You can't go higher as current threshold is related to medicine
            self._max_threshold = self.threshold_
            if self.threshold_ - self.step < self._min_threshold:
                return False
            # Lower the threshold
            self.threshold_ -= self.step
            return True
        if decision.lower() == "n":
            # You can't got lower than this, as current threshold is not related to medicine already
            self._min_threshold = self.threshold_
            # Multiply threshold to pinpoint exact spot
            self.step *= self.change_multiplier
            if self.threshold_ + self.step < self._max_threshold:
                return False
            # Lower the threshold
            self.threshold_ += self.step
            return True
        return False

    def fit(self):
        for _ in range(self.max_steps):
            available_concepts_indices = np.nonzero(
                self.concepts_similarity >= self.threshold_
            )[0]
            if available_concepts_indices.size != 0:
                decision = self._ask_expert(available_concepts_indices)
                if not self._parse_expert_decision(decision):
                    break
            else:
                self.threshold_ -= self.step
        return self

总而言之,你必须手动回答一些问题,但在我看来,这种方法更为准确。在

此外,您不必查看所有样本,只需查看其中的一小部分。你可以决定多少个样本构成一个医疗术语(是否显示40个医疗样本和10个非医疗样本,仍应视为医疗样本?),这样您就可以根据自己的喜好对这种方法进行微调。如果有一个异常值(比如说,50个样本中有1个是非医疗样本),我会认为阈值仍然有效。在

再一次:这种方法应与其他方法混合使用,以尽量减少错误分类的机会。在

分级机

当我们从专家那里得到阈值时,分类是即时的,下面是一个简单的分类:

class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions

为了简洁起见,下面是最终的源代码:

import json
import typing

import numpy as np
import spacy


class Similarity:
    def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
        # In our case it will be medicine
        self.centroid = centroid

        # spaCy's Language model (english), which will be used to return similarity to
        # centroid of each concept
        self.nlp = nlp
        self.n_threads: int = n_threads
        self.batch_size: int = batch_size

        self.missing: typing.List[int] = []

    def __call__(self, concepts):
        concepts_similarity = []
        # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
        for i, concept in enumerate(
            self.nlp.pipe(
                concepts, n_threads=self.n_threads, batch_size=self.batch_size
            )
        ):
            if concept.has_vector:
                concepts_similarity.append(self.centroid.similarity(concept))
            else:
                # If document has no vector, it's assumed to be totally dissimilar to centroid
                concepts_similarity.append(-1)
                self.missing.append(i)

        return np.array(concepts_similarity)


class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        samples: int,
        max_steps: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.samples: int = samples
        self.max_steps: int = max_steps
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

    def _ask_expert(self, available_concepts_indices):
        # Get random concepts (the ones above the threshold)
        concepts_to_show = set(
            np.random.choice(
                available_concepts_indices, len(available_concepts_indices)
            ).tolist()
        )
        # Remove those already presented to an expert
        concepts_to_show = concepts_to_show - self._checked_concepts
        self._checked_concepts.update(concepts_to_show)
        # Print message for an expert and concepts to be classified
        if concepts_to_show:
            print("\nAre those concepts related to medicine?\n")
            print(
                "\n".join(
                    f"{i}. {concept}"
                    for i, concept in enumerate(
                        self.concepts[list(concepts_to_show)[: self.samples]]
                    )
                ),
                "\n",
            )
            return input("[y]es / [n]o / [any]quit ")
        return "y"

    # True - keep asking, False - stop the algorithm
    def _parse_expert_decision(self, decision) -> bool:
        if decision.lower() == "y":
            # You can't go higher as current threshold is related to medicine
            self._max_threshold = self.threshold_
            if self.threshold_ - self.step < self._min_threshold:
                return False
            # Lower the threshold
            self.threshold_ -= self.step
            return True
        if decision.lower() == "n":
            # You can't got lower than this, as current threshold is not related to medicine already
            self._min_threshold = self.threshold_
            # Multiply threshold to pinpoint exact spot
            self.step *= self.change_multiplier
            if self.threshold_ + self.step < self._max_threshold:
                return False
            # Lower the threshold
            self.threshold_ += self.step
            return True
        return False

    def fit(self):
        for _ in range(self.max_steps):
            available_concepts_indices = np.nonzero(
                self.concepts_similarity >= self.threshold_
            )[0]
            if available_concepts_indices.size != 0:
                decision = self._ask_expert(available_concepts_indices)
                if not self._parse_expert_decision(decision):
                    break
            else:
                self.threshold_ -= self.step
        return self


class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions


if __name__ == "__main__":
    nlp = spacy.load("en_vectors_web_lg")

    centroid = nlp("medicine")

    concepts = json.load(open("concepts_new.txt"))
    concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
        concepts
    )

    learner = ActiveLearner(
        np.array(concepts), concepts_similarity, samples=20, max_steps=50
    ).fit()
    print(f"Found threshold {learner.threshold_}\n")

    classifier = Classifier(centroid, learner.threshold_)
    pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096)
    predictions = classifier.predict(pipe)
    print(
        "\n".join(
            f"{concept}: {label}"
            for concept, label in zip(concepts[20:40], predictions[20:40])
        )
    )

在回答了一些问题之后,在阈值为0.1的情况下(介于[-1, 0.1)之间的所有内容都被认为是非医疗性的,而{}之间的所有内容都被认为是医学的)我得到了以下结果:

kartagener s syndrome: True
summer season: True
taq: False
atypical neuroleptic: True
anterior cingulate: False
acute respiratory distress syndrome: True
circularity: False
mutase: False
adrenergic blocking drug: True
systematic desensitization: True
the turning point: True
9l: False
pyridazine: False
bisoprolol: False
trq: False
propylhexedrine: False
type 18: True
darpp 32: False
rickettsia conorii: False
sport shoe: True

正如您所看到的,这种方法还远远不够完美,因此最后一节介绍了可能的改进:

可能的改进

正如一开始提到的那样,使用我的方法和其他答案混合使用可能会忽略不计像sport shoe属于medicine的想法和主动学习方法在上面提到的两种启发式方法之间平局的情况下,更像是决定性的投票。在

我们也可以建立一个积极的学习团体。代替一个阈值,比如0.1,我们将使用多个阈值(增加或减少),假设这些是0.1, 0.2, 0.3, 0.4, 0.5。在

假设sport shoe得到,对于每个阈值,它分别是{},如下所示:

True True False False False

如果投多数票,我们将以2票中的3票将其标记为{}。此外,如果低于这个阈值的阈值投了反对票,那么太严格的阈值也会得到缓解(如果True/False看起来像这样:True True True False False)。在

最后一个可能的改进是我提出的:在上面的代码中,我使用Doc向量,这是创建概念的词向量的一种方式。假设一个单词缺失(由零组成的向量),在这种情况下,它将被推离medicine质心。您可能不希望这样(因为一些利基医学术语[缩写如gpv或其他词]可能缺少它们的表示形式),在这种情况下,您只能平均那些与零不同的向量。在

我知道这篇文章很长,所以如果你有任何问题,请把它们贴在下面。在

相关问题 更多 >

    热门问题