如何在python中对wikipedia类别进行分组？问题的回答

如何在python中对wikipedia类别进行分组？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

对于数据集的每个概念，我都存储了相应的wikipedia类别。例如，考虑以下5个概念及其对应的wikipedia类别。在 <ul> <li>高甘油三酯血症：<code>['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']</code></li> <li>酶抑制剂：<code>['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']</code></li> <li>旁路手术：<code>['Category:Surgery stubs', 'Category:Surgical procedures and techniques']</code></li> <li>珀斯：<code>['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']</code></li> <li>气候：<code>['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']</code></li> </ul> 如您所见，前三个概念属于医学领域（而其余两个术语不是医学术语）。在 更确切地说，我想把我的概念分为医学和非医学。然而，仅仅用范畴来划分概念是非常困难的。例如，尽管<code>enzyme inhibitor</code>和{<cd7>}这两个概念属于医学领域，但它们的范畴却大不相同。在 因此，我想知道是否有方法获得类别的<code>parent category</code>（例如，<code>enzyme inhibitor</code>和{<cd7>}的类别属于<code>medical</code>父类别） 我当前正在使用<code>pymediawiki</code>和<code>pywikibot</code>。然而，我并不局限于这两个库，我也很高兴有解决方案使用其他库。在 编辑 根据@IlmariKaronen的建议，我也使用了<code>categories of categories</code>，结果如下（靠近<code>category</code>的小字体是<code>categories of the category</code>）。 <a href="https://i.stack.imgur.com/oSPla.png" rel="noreferrer"><img src="https://i.stack.imgur.com/oSPla.png" alt="enter image description here"/></a> 然而，我仍然无法找到一种方法来使用这些类别的细节来决定一个给定的术语是医学的还是非医学的。在 此外，正如@IlmariKaronen所指出的那样，使用<code>Wikiproject</code>的细节可能是潜在的。然而，似乎<code>Medicine</code>wikiproject似乎没有所有的医学术语。因此，我们还需要检查其他wikiprojects。在 编辑： 我当前从wikipedia概念中提取类别的代码如下。可以使用<code>pywikibot</code>或<code>pymediawiki</code>来完成此操作，如下所示。在 <ol> <li>使用库<code>pymediawiki</code> 以pw格式导入mediawiki <pre><code>p = wikipedia.page('enzyme inhibitor') print(p.categories) </code></pre></li> <li>使用库<code>pywikibot</code> <pre><code>import pywikibot as pw site = pw.Site('en', 'wikipedia') print([ cat.title() for cat in pw.Page(site, 'support-vector machine').categories() if 'hidden' not in cat.categoryinfo ]) </code></pre></li> </ol> 类别的分类也可以按照@IlmariKaronen的答案中所示的方式进行。在 如果您正在寻找一个较长的测试概念列表，我已经在下面提到了更多的例子。在 <pre><code>['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'newyork', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous'] </code></pre> 对于一个很长的列表，请检查下面的链接。<a href="https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing" rel="noreferrer">https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing</a> 注意：我不希望解决方案100%有效（如果所提出的算法能够检测到对我来说足够多的医学概念） 如果需要，我很乐意提供更多细节。在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<h2>解决方案概述</h2> 好吧，我会从多个方向来解决这个问题。这里有一些很好的建议，如果我是你的话，我会使用这些方法的集合（多数投票，预测标签，在你的二进制案例中，超过50%的分类器同意这一点）。在 我正在考虑以下方法： <ul> <li>主动学习（下面由我提供的示例方法）</li> <li><a href="https://stackoverflow.com/a/54757134/10886420">MediaWiki backlinks</a>由{a2}作为答案提供</li> <li>SPARQL祖先类别由<a href="https://stackoverflow.com/users/7879193/stanislav-kralin">@Stanislav Kralin</a>和/或<a href="https://stackoverflow.com/a/54781366/10886420">parent categories</a>提供，作为对您问题的注释（这两个类别可以根据它们的差异单独组成一个集合，但为此您必须联系两个创建者并比较它们的结果）。在</li> </ul> 这样，三分之二的人必须同意某个概念是医学概念，这将进一步减少出错的可能性。在 当我们讨论这个问题时，我会反对<a href="https://stackoverflow.com/users/10953776/anand-v-singh">@ananand_v.singh</a>在<a href="https://stackoverflow.com/a/54721431/10886420">this answer</a>中提出的方法，因为： <ul> <li>距离度量不应该是欧式的，余弦相似度是更好的度量（例如，<a href="https://spacy.io/" rel="noreferrer">spaCy</a>）因为它没有考虑向量的大小（也不应该，这就是word2vec或GloVe是如何训练的）</li> <li>如果我理解正确的话，很多人造集群会被创造出来，而我们只需要两个：药物和非药物。此外，医学的重心并不是以药物本身为中心。这会带来额外的问题，比如说质心远离医学，其他的词，比如说，<code>computer</code>或{<cd2>}（或者任何其他不适合你的医学观点的词）可能会进入到聚类中。在</li> <li>很难评估结果，更重要的是，这件事完全是主观的。此外，单词向量很难被可视化和理解（使用PCA/TSNE/相似的方法将它们转换成低维[2D/3D]，对于这么多的单词，会给我们完全不敏感的结果[是的，我已经尝试过了，PCA对于你较长的数据集得到了大约5%的解释方差，真的非常低]）。在</li> </ul> 基于上面强调的问题，我提出了使用<a href="https://en.wikipedia.org/wiki/Active_learning_(machine_learning)" rel="noreferrer">active learning</a>的解决方案，这是一种很容易被遗忘的解决此类问题的方法。在 <h2>主动学习法</h2> 在机器学习的这个子集中，当我们很难想出一个精确的算法（比如一个术语是<code>medical</code>类别的一部分意味着什么），我们会要求人类的“专家”（实际上并不一定是专家）来提供一些答案。在 <h2>知识编码</h2> 正如<a href="https://stackoverflow.com/users/10953776/anand-v-singh">anand_v.singh</a>所指出的，单词向量是最有前途的方法之一，我在这里也将使用它（尽管与此不同，IMO以一种更干净、更简单的方式）。在 我不打算在我的回答中重复他的观点，所以我要加上我的两分钱： <ul> <li>不要将上下文化的单词嵌入作为当前可用的技术（例如<a href="https://arxiv.org/pdf/1810.04805.pdf" rel="noreferrer">BERT</a>）</li> <li>检查有多少概念没有表示（例如，表示为零向量）。它应该被检查（并且在我的代码中被检查过，到时候会有进一步的讨论），并且你可以使用嵌入，它包含了大部分内容。在</li> </ul> <h3>使用spaCy</h3> 此类度量编码为spaCy的手套词向量的<code>medicine</code>与其他概念之间的相似性。在 <pre><code>class Similarity: def __init__(self, centroid, nlp, n_threads: int, batch_size: int): # In our case it will be medicine self.centroid = centroid # spaCy's Language model (english), which will be used to return similarity to # centroid of each concept self.nlp = nlp self.n_threads: int = n_threads self.batch_size: int = batch_size self.missing: typing.List[int] = [] def __call__(self, concepts): concepts_similarity = [] # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL) for i, concept in enumerate( self.nlp.pipe( concepts, n_threads=self.n_threads, batch_size=self.batch_size ) ): if concept.has_vector: concepts_similarity.append(self.centroid.similarity(concept)) else: # If document has no vector, it's assumed to be totally dissimilar to centroid concepts_similarity.append(-1) self.missing.append(i) return np.array(concepts_similarity) </code></pre> 此代码将为每个概念返回一个数字，以测量它与质心的相似程度。此外，它还记录了缺少表示的概念的索引。可以这样称呼： ^{pr2}$ 您可以用数据代替<code>new_concepts.json</code>。在 看一下<a href="https://spacy.io/usage/models" rel="noreferrer">spacy.load</a>，注意我使用了<a href="https://spacy.io/models/en#en_vectors_web_lg" rel="noreferrer">^{<cd6>}</a>。它由685.000个独特的词向量组成（这是一个很大的量），并且可能会为您的案例提供现成的解决方案。安装s后必须单独下载paCy，更多信息在上面的链接中提供。在 另外，您可能需要使用多个质心词，例如添加<code>disease</code>或{<cd8>}等单词，并对其词向量求平均值。但我不确定这是否会对你的案子产生积极影响。在 另一种可能性是使用多个质心，并计算每个概念与多个质心之间的相似性。在这种情况下，我们可能有一些阈值，这可能会删除一些<a href="https://en.wikipedia.org/wiki/False_positives_and_false_negatives" rel="noreferrer">false positives</a>，但可能会遗漏一些可以认为与<code>medicine</code>相似的术语。此外，这会使情况复杂得多，但如果你的结果不令人满意，你应该考虑以上两种选择（只有在这些都是，不要在没有事先考虑的情况下才开始采用这种方法）。在 现在，我们有了一个概念相似性的粗略度量。但是，某个概念与医学有0.1的正相似性意味着什么？这是一个应该归为医学的概念吗？或者可能已经太远了？在 <h2>询问专家</h2> 为了得到一个阈值（低于它的术语将被认为是非医学的），最简单的方法是让人类为我们分类一些概念（这就是主动学习的意义所在）。是的，我知道这是一种非常简单的主动学习方式，但我还是会这么认为的。在 我编写了一个带有<code>sklearn-like</code>接口的类，要求人类对概念进行分类，直到达到最佳阈值（或最大迭代次数）。在 <pre><code>class ActiveLearner: def __init__( self, concepts, concepts_similarity, max_steps: int, samples: int, step: float = 0.05, change_multiplier: float = 0.7, ): sorting_indices = np.argsort(-concepts_similarity) self.concepts = concepts[sorting_indices] self.concepts_similarity = concepts_similarity[sorting_indices] self.max_steps: int = max_steps self.samples: int = samples self.step: float = step self.change_multiplier: float = change_multiplier # We don't have to ask experts for the same concepts self._checked_concepts: typing.Set[int] = set() # Minimum similarity between vectors is -1 self._min_threshold: float = -1 # Maximum similarity between vectors is 1 self._max_threshold: float = 1 # Let's start from the highest similarity to ensure minimum amount of steps self.threshold_: float = 1 </code></pre> <ul> <li><code>samples</code>参数描述在每次迭代过程中将向专家显示多少个示例（这是最大值，如果已经请求了样本或者没有足够的示例来显示，则返回的示例会更少）。在</li> <li><code>step</code>表示每个迭代中阈值的下降（我们从1开始表示完全相似）。在</li> <li><code>change_multiplier</code>-如果专家的答案是不相关的（或者大部分是不相关的，因为返回了多个概念），则步骤乘以这个浮点数。它用于在每次迭代中精确确定<code>step</code>变化之间的阈值。在</li> <li>概念根据其相似性进行排序（概念越相似，越高）</li> </ul> 下面的函数向专家征求意见，并根据他的答案找到最佳阈值。在 <pre><code>def _ask_expert(self, available_concepts_indices): # Get random concepts (the ones above the threshold) concepts_to_show = set( np.random.choice( available_concepts_indices, len(available_concepts_indices) ).tolist() ) # Remove those already presented to an expert concepts_to_show = concepts_to_show - self._checked_concepts self._checked_concepts.update(concepts_to_show) # Print message for an expert and concepts to be classified if concepts_to_show: print("\nAre those concepts related to medicine?\n") print( "\n".join( f"{i}. {concept}" for i, concept in enumerate( self.concepts[list(concepts_to_show)[: self.samples]] ) ), "\n", ) return input("[y]es / [n]o / [any]quit ") return "y" </code></pre> 示例问题如下： <pre><code>Are those concepts related to medicine? 0. anesthetic drug 1. child and adolescent psychiatry 2. tertiary care center 3. sex therapy 4. drug design 5. pain disorder 6. psychiatric rehabilitation 7. combined oral contraceptive 8. family practitioner committee 9. cancer family syndrome 10. social psychology 11. drug sale 12. blood system [y]es / [n]o / [any]quit y </code></pre> 。。。解析专家的答案： <pre><code># True - keep asking, False - stop the algorithm def _parse_expert_decision(self, decision) -> bool: if decision.lower() == "y": # You can't go higher as current threshold is related to medicine self._max_threshold = self.threshold_ if self.threshold_ - self.step < self._min_threshold: return False # Lower the threshold self.threshold_ -= self.step return True if decision.lower() == "n": # You can't got lower than this, as current threshold is not related to medicine already self._min_threshold = self.threshold_ # Multiply threshold to pinpoint exact spot self.step *= self.change_multiplier if self.threshold_ + self.step < self._max_threshold: return False # Lower the threshold self.threshold_ += self.step return True return False </code></pre> 最后是<code>ActiveLearner</code>的整个代码代码，根据专家的判断，找到最佳相似阈值： <pre><code>class ActiveLearner: def __init__( self, concepts, concepts_similarity, samples: int, max_steps: int, step: float = 0.05, change_multiplier: float = 0.7, ): sorting_indices = np.argsort(-concepts_similarity) self.concepts = concepts[sorting_indices] self.concepts_similarity = concepts_similarity[sorting_indices] self.samples: int = samples self.max_steps: int = max_steps self.step: float = step self.change_multiplier: float = change_multiplier # We don't have to ask experts for the same concepts self._checked_concepts: typing.Set[int] = set() # Minimum similarity between vectors is -1 self._min_threshold: float = -1 # Maximum similarity between vectors is 1 self._max_threshold: float = 1 # Let's start from the highest similarity to ensure minimum amount of steps self.threshold_: float = 1 def _ask_expert(self, available_concepts_indices): # Get random concepts (the ones above the threshold) concepts_to_show = set( np.random.choice( available_concepts_indices, len(available_concepts_indices) ).tolist() ) # Remove those already presented to an expert concepts_to_show = concepts_to_show - self._checked_concepts self._checked_concepts.update(concepts_to_show) # Print message for an expert and concepts to be classified if concepts_to_show: print("\nAre those concepts related to medicine?\n") print( "\n".join( f"{i}. {concept}" for i, concept in enumerate( self.concepts[list(concepts_to_show)[: self.samples]] ) ), "\n", ) return input("[y]es / [n]o / [any]quit ") return "y" # True - keep asking, False - stop the algorithm def _parse_expert_decision(self, decision) -> bool: if decision.lower() == "y": # You can't go higher as current threshold is related to medicine self._max_threshold = self.threshold_ if self.threshold_ - self.step < self._min_threshold: return False # Lower the threshold self.threshold_ -= self.step return True if decision.lower() == "n": # You can't got lower than this, as current threshold is not related to medicine already self._min_threshold = self.threshold_ # Multiply threshold to pinpoint exact spot self.step *= self.change_multiplier if self.threshold_ + self.step < self._max_threshold: return False # Lower the threshold self.threshold_ += self.step return True return False def fit(self): for _ in range(self.max_steps): available_concepts_indices = np.nonzero( self.concepts_similarity >= self.threshold_ )[0] if available_concepts_indices.size != 0: decision = self._ask_expert(available_concepts_indices) if not self._parse_expert_decision(decision): break else: self.threshold_ -= self.step return self </code></pre> 总而言之，你必须手动回答一些问题，但在我看来，这种方法更为准确。在 此外，您不必查看所有样本，只需查看其中的一小部分。你可以决定多少个样本构成一个医疗术语（是否显示40个医疗样本和10个非医疗样本，仍应视为医疗样本？），这样您就可以根据自己的喜好对这种方法进行微调。如果有一个异常值（比如说，50个样本中有1个是非医疗样本），我会认为阈值仍然有效。在 再一次：这种方法应与其他方法混合使用，以尽量减少错误分类的机会。在 <h2>分级机</h2> 当我们从专家那里得到阈值时，分类是即时的，下面是一个简单的分类： <pre><code>class Classifier: def __init__(self, centroid, threshold: float): self.centroid = centroid self.threshold: float = threshold def predict(self, concepts_pipe): predictions = [] for concept in concepts_pipe: predictions.append(self.centroid.similarity(concept) > self.threshold) return predictions </code></pre> 为了简洁起见，下面是最终的源代码： <pre><code>import json import typing import numpy as np import spacy class Similarity: def __init__(self, centroid, nlp, n_threads: int, batch_size: int): # In our case it will be medicine self.centroid = centroid # spaCy's Language model (english), which will be used to return similarity to # centroid of each concept self.nlp = nlp self.n_threads: int = n_threads self.batch_size: int = batch_size self.missing: typing.List[int] = [] def __call__(self, concepts): concepts_similarity = [] # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL) for i, concept in enumerate( self.nlp.pipe( concepts, n_threads=self.n_threads, batch_size=self.batch_size ) ): if concept.has_vector: concepts_similarity.append(self.centroid.similarity(concept)) else: # If document has no vector, it's assumed to be totally dissimilar to centroid concepts_similarity.append(-1) self.missing.append(i) return np.array(concepts_similarity) class ActiveLearner: def __init__( self, concepts, concepts_similarity, samples: int, max_steps: int, step: float = 0.05, change_multiplier: float = 0.7, ): sorting_indices = np.argsort(-concepts_similarity) self.concepts = concepts[sorting_indices] self.concepts_similarity = concepts_similarity[sorting_indices] self.samples: int = samples self.max_steps: int = max_steps self.step: float = step self.change_multiplier: float = change_multiplier # We don't have to ask experts for the same concepts self._checked_concepts: typing.Set[int] = set() # Minimum similarity between vectors is -1 self._min_threshold: float = -1 # Maximum similarity between vectors is 1 self._max_threshold: float = 1 # Let's start from the highest similarity to ensure minimum amount of steps self.threshold_: float = 1 def _ask_expert(self, available_concepts_indices): # Get random concepts (the ones above the threshold) concepts_to_show = set( np.random.choice( available_concepts_indices, len(available_concepts_indices) ).tolist() ) # Remove those already presented to an expert concepts_to_show = concepts_to_show - self._checked_concepts self._checked_concepts.update(concepts_to_show) # Print message for an expert and concepts to be classified if concepts_to_show: print("\nAre those concepts related to medicine?\n") print( "\n".join( f"{i}. {concept}" for i, concept in enumerate( self.concepts[list(concepts_to_show)[: self.samples]] ) ), "\n", ) return input("[y]es / [n]o / [any]quit ") return "y" # True - keep asking, False - stop the algorithm def _parse_expert_decision(self, decision) -> bool: if decision.lower() == "y": # You can't go higher as current threshold is related to medicine self._max_threshold = self.threshold_ if self.threshold_ - self.step < self._min_threshold: return False # Lower the threshold self.threshold_ -= self.step return True if decision.lower() == "n": # You can't got lower than this, as current threshold is not related to medicine already self._min_threshold = self.threshold_ # Multiply threshold to pinpoint exact spot self.step *= self.change_multiplier if self.threshold_ + self.step < self._max_threshold: return False # Lower the threshold self.threshold_ += self.step return True return False def fit(self): for _ in range(self.max_steps): available_concepts_indices = np.nonzero( self.concepts_similarity >= self.threshold_ )[0] if available_concepts_indices.size != 0: decision = self._ask_expert(available_concepts_indices) if not self._parse_expert_decision(decision): break else: self.threshold_ -= self.step return self class Classifier: def __init__(self, centroid, threshold: float): self.centroid = centroid self.threshold: float = threshold def predict(self, concepts_pipe): predictions = [] for concept in concepts_pipe: predictions.append(self.centroid.similarity(concept) > self.threshold) return predictions if __name__ == "__main__": nlp = spacy.load("en_vectors_web_lg") centroid = nlp("medicine") concepts = json.load(open("concepts_new.txt")) concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)( concepts ) learner = ActiveLearner( np.array(concepts), concepts_similarity, samples=20, max_steps=50 ).fit() print(f"Found threshold {learner.threshold_}\n") classifier = Classifier(centroid, learner.threshold_) pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096) predictions = classifier.predict(pipe) print( "\n".join( f"{concept}: {label}" for concept, label in zip(concepts[20:40], predictions[20:40]) ) ) </code></pre> 在回答了一些问题之后，在阈值为0.1的情况下（介于<code>[-1, 0.1)</code>之间的所有内容都被认为是非医疗性的，而{<cd17>}之间的所有内容都被认为是医学的）我得到了以下结果： <pre><code>kartagener s syndrome: True summer season: True taq: False atypical neuroleptic: True anterior cingulate: False acute respiratory distress syndrome: True circularity: False mutase: False adrenergic blocking drug: True systematic desensitization: True the turning point: True 9l: False pyridazine: False bisoprolol: False trq: False propylhexedrine: False type 18: True darpp 32: False rickettsia conorii: False sport shoe: True </code></pre> 正如您所看到的，这种方法还远远不够完美，因此最后一节介绍了可能的改进： <h2>可能的改进</h2> 正如一开始提到的那样，使用我的方法和其他答案混合使用可能会忽略不计像<code>sport shoe</code>属于<code>medicine</code>的想法和主动学习方法在上面提到的两种启发式方法之间平局的情况下，更像是决定性的投票。在 我们也可以建立一个积极的学习团体。代替一个阈值，比如0.1，我们将使用多个阈值（增加或减少），假设这些是<code>0.1, 0.2, 0.3, 0.4, 0.5</code>。在 假设<code>sport shoe</code>得到，对于每个阈值，它分别是{<cd22>}，如下所示： <code>True True False False False</code> 如果投多数票，我们将以2票中的3票将其标记为{<cd24>}。此外，如果低于这个阈值的阈值投了反对票，那么太严格的阈值也会得到缓解（如果<code>True/False</code>看起来像这样：<code>True True True False False</code>）。在 最后一个可能的改进是我提出的：在上面的代码中，我使用<code>Doc</code>向量，这是创建概念的词向量的一种方式。假设一个单词缺失（由零组成的向量），在这种情况下，它将被推离<code>medicine</code>质心。您可能不希望这样（因为一些利基医学术语[缩写如<code>gpv</code>或其他词]可能缺少它们的表示形式），在这种情况下，您只能平均那些与零不同的向量。在 我知道这篇文章很长，所以如果你有任何问题，请把它们贴在下面。在

如何在python中对wikipedia类别进行分组？

1 个回答

相关Python问题