<h2>解决方案概述</h2>
<p>好吧,我会从多个方向来解决这个问题。这里有一些很好的建议,如果我是你的话,我会使用这些方法的集合(多数投票,预测标签,在你的二进制案例中,超过50%的分类器同意这一点)。在</p>
<p><strong>我正在考虑以下方法:</strong></p>
<ul>
<li><strong>主动学习</strong>(下面由我提供的示例方法)</li>
<li><a href="https://stackoverflow.com/a/54757134/10886420"><strong>MediaWiki backlinks</strong></a>由{a2}作为答案提供</li>
<li><strong>SPARQL</strong>祖先类别由<a href="https://stackoverflow.com/users/7879193/stanislav-kralin">@Stanislav Kralin</a>和/或<a href="https://stackoverflow.com/a/54781366/10886420">parent categories</a>提供,作为对您问题的注释(这两个类别可以根据它们的差异单独组成一个集合,但为此您必须联系两个创建者并比较它们的结果)。在</li>
</ul>
<p>这样,三分之二的人必须同意某个概念是医学概念,这将进一步减少出错的可能性。在</p>
<p>当我们讨论这个问题时,我会反对<a href="https://stackoverflow.com/users/10953776/anand-v-singh">@ananand_v.singh</a>在<a href="https://stackoverflow.com/a/54721431/10886420">this answer</a>中提出的</strong>方法,因为:</p>
<ul>
<li>距离度量不应该</strong>是欧式的,余弦相似度是更好的度量(例如,<a href="https://spacy.io/" rel="noreferrer">spaCy</a>)因为它没有考虑向量的大小(也不应该,这就是word2vec或GloVe是如何训练的)</li>
<li>如果我理解正确的话,很多人造集群会被创造出来,而我们只需要两个:药物和非药物。此外,医学的重心并不是以药物本身为中心。这会带来额外的问题,比如说质心远离医学,其他的词,比如说,<code>computer</code>或{<cd2>}(或者任何其他不适合你的医学观点的词)可能会进入到聚类中。在</li>
<li>很难评估结果,更重要的是,这件事完全是主观的。此外,单词向量很难被可视化和理解(使用PCA/TSNE/相似的方法将它们转换成低维[2D/3D],对于这么多的单词,会给我们完全不敏感的结果[是的,我已经尝试过了,PCA对于你较长的数据集得到了大约5%的解释方差,真的非常低])。在</li>
</ul>
<p>基于上面强调的问题,我提出了使用<a href="https://en.wikipedia.org/wiki/Active_learning_(machine_learning)" rel="noreferrer">active learning</a>的解决方案,这是一种很容易被遗忘的解决此类问题的方法。在</p>
<h2>主动学习法</h2>
<p>在机器学习的这个子集中,当我们很难想出一个精确的算法(比如一个术语是<code>medical</code>类别的一部分意味着什么),我们会要求人类的“专家”(实际上并不一定是专家)来提供一些答案。在</p>
<h2>知识编码</h2>
<p>正如<a href="https://stackoverflow.com/users/10953776/anand-v-singh">anand_v.singh</a>所指出的,单词向量是最有前途的方法之一,我在这里也将使用它(尽管与此不同,IMO以一种更干净、更简单的方式)。在</p>
<p>我不打算在我的回答中重复他的观点,所以我要加上我的两分钱:</p>
<ul>
<li><strong>不要</strong>将上下文化的单词嵌入作为当前可用的技术(例如<a href="https://arxiv.org/pdf/1810.04805.pdf" rel="noreferrer">BERT</a>)</li>
<li>检查有多少概念<strong>没有表示</strong>(例如,表示为零向量)。它应该被检查(并且在我的代码中被检查过,到时候会有进一步的讨论),并且你可以使用嵌入,它包含了大部分内容。在</li>
</ul>
<h3>使用<em>spaCy</em></h3>
<p>此类度量编码为spaCy的手套词向量的<code>medicine</code>与其他概念之间的相似性。在</p>
<pre><code>class Similarity:
def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
# In our case it will be medicine
self.centroid = centroid
# spaCy's Language model (english), which will be used to return similarity to
# centroid of each concept
self.nlp = nlp
self.n_threads: int = n_threads
self.batch_size: int = batch_size
self.missing: typing.List[int] = []
def __call__(self, concepts):
concepts_similarity = []
# nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
for i, concept in enumerate(
self.nlp.pipe(
concepts, n_threads=self.n_threads, batch_size=self.batch_size
)
):
if concept.has_vector:
concepts_similarity.append(self.centroid.similarity(concept))
else:
# If document has no vector, it's assumed to be totally dissimilar to centroid
concepts_similarity.append(-1)
self.missing.append(i)
return np.array(concepts_similarity)
</code></pre>
<p>此代码将为每个概念返回一个数字,以测量它与质心的相似程度。此外,它还记录了缺少表示的概念的索引。可以这样称呼:</p>
^{pr2}$
<p>您可以用数据代替<code>new_concepts.json</code>。在</p>
<p>看一下<a href="https://spacy.io/usage/models" rel="noreferrer">spacy.load</a>,注意我使用了<a href="https://spacy.io/models/en#en_vectors_web_lg" rel="noreferrer">^{<cd6>}</a>。它由<strong>685.000个独特的词向量组成</strong>(这是一个很大的量),并且可能会为您的案例提供现成的解决方案。安装s后必须单独下载paCy,更多信息在上面的链接中提供。在</p>
<p><strong>另外,</strong>您可能需要使用多个质心词,例如添加<code>disease</code>或{<cd8>}等单词,并对其词向量求平均值。但我不确定这是否会对你的案子产生积极影响。在</p>
<p><strong>另一种可能性是使用多个质心,并计算每个概念与多个质心之间的相似性。在这种情况下,我们可能有一些阈值,这可能会删除一些<a href="https://en.wikipedia.org/wiki/False_positives_and_false_negatives" rel="noreferrer">false positives</a>,但可能会遗漏一些可以认为与<code>medicine</code>相似的术语。此外,这会使情况复杂得多,但如果你的结果不令人满意,你应该考虑以上两种选择(只有在这些都是,不要在没有事先考虑的情况下才开始采用这种方法)。在</p>
<p>现在,我们有了一个概念相似性的粗略度量。但是,<strong>某个概念与医学有0.1的正相似性意味着什么?这是一个应该归为医学的概念吗?或者可能已经太远了?在</p>
<h2>询问专家</h2>
<p>为了得到一个阈值(低于它的术语将被认为是非医学的),最简单的方法是让人类为我们分类一些概念(这就是主动学习的意义所在)。是的,我知道这是一种非常简单的主动学习方式,但我还是会这么认为的。在</p>
<p>我编写了一个带有<code>sklearn-like</code>接口的类,要求人类对概念进行分类,直到达到最佳阈值(或最大迭代次数)。在</p>
<pre><code>class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
max_steps: int,
samples: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]
self.max_steps: int = max_steps
self.samples: int = samples
self.step: float = step
self.change_multiplier: float = change_multiplier
# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1
# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1
</code></pre>
<ul>
<li><code>samples</code>参数描述在每次迭代过程中将向专家显示多少个示例(这是最大值,如果已经请求了样本或者没有足够的示例来显示,则返回的示例会更少)。在</li>
<li><code>step</code>表示每个迭代中阈值的下降(我们从1开始表示完全相似)。在</li>
<li><code>change_multiplier</code>-如果专家的答案是不相关的(或者大部分是不相关的,因为返回了多个概念),则步骤乘以这个浮点数。它用于在每次迭代中精确确定<code>step</code>变化之间的阈值。在</li>
<li>概念根据其相似性进行排序(概念越相似,越高)</li>
</ul>
<p>下面的函数向专家征求意见,并根据他的答案找到最佳阈值。在</p>
<pre><code>def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"
</code></pre>
<p>示例问题如下:</p>
<pre><code>Are those concepts related to medicine?
0. anesthetic drug
1. child and adolescent psychiatry
2. tertiary care center
3. sex therapy
4. drug design
5. pain disorder
6. psychiatric rehabilitation
7. combined oral contraceptive
8. family practitioner committee
9. cancer family syndrome
10. social psychology
11. drug sale
12. blood system
[y]es / [n]o / [any]quit y
</code></pre>
<p>。。。解析专家的答案:</p>
<pre><code># True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False
</code></pre>
<p>最后是<code>ActiveLearner</code>的整个代码代码,根据专家的判断,找到最佳相似阈值:</p>
<pre><code>class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
samples: int,
max_steps: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]
self.samples: int = samples
self.max_steps: int = max_steps
self.step: float = step
self.change_multiplier: float = change_multiplier
# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1
# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1
def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"
# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False
def fit(self):
for _ in range(self.max_steps):
available_concepts_indices = np.nonzero(
self.concepts_similarity >= self.threshold_
)[0]
if available_concepts_indices.size != 0:
decision = self._ask_expert(available_concepts_indices)
if not self._parse_expert_decision(decision):
break
else:
self.threshold_ -= self.step
return self
</code></pre>
<p>总而言之,你必须手动回答一些问题,但在我看来,这种方法更为准确。在</p>
<p>此外,您不必查看所有样本,只需查看其中的一小部分。你可以决定多少个样本构成一个医疗术语(是否显示40个医疗样本和10个非医疗样本,仍应视为医疗样本?),这样您就可以根据自己的喜好对这种方法进行微调。如果有一个异常值(比如说,50个样本中有1个是非医疗样本),我会认为阈值仍然有效。在</p>
<p><strong>再一次:</strong>这种方法应与其他方法混合使用,以尽量减少错误分类的机会。在</p>
<h2>分级机</h2>
<p>当我们从专家那里得到阈值时,分类是即时的,下面是一个简单的分类:</p>
<pre><code>class Classifier:
def __init__(self, centroid, threshold: float):
self.centroid = centroid
self.threshold: float = threshold
def predict(self, concepts_pipe):
predictions = []
for concept in concepts_pipe:
predictions.append(self.centroid.similarity(concept) > self.threshold)
return predictions
</code></pre>
<p>为了简洁起见,下面是最终的源代码:</p>
<pre><code>import json
import typing
import numpy as np
import spacy
class Similarity:
def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
# In our case it will be medicine
self.centroid = centroid
# spaCy's Language model (english), which will be used to return similarity to
# centroid of each concept
self.nlp = nlp
self.n_threads: int = n_threads
self.batch_size: int = batch_size
self.missing: typing.List[int] = []
def __call__(self, concepts):
concepts_similarity = []
# nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
for i, concept in enumerate(
self.nlp.pipe(
concepts, n_threads=self.n_threads, batch_size=self.batch_size
)
):
if concept.has_vector:
concepts_similarity.append(self.centroid.similarity(concept))
else:
# If document has no vector, it's assumed to be totally dissimilar to centroid
concepts_similarity.append(-1)
self.missing.append(i)
return np.array(concepts_similarity)
class ActiveLearner:
def __init__(
self,
concepts,
concepts_similarity,
samples: int,
max_steps: int,
step: float = 0.05,
change_multiplier: float = 0.7,
):
sorting_indices = np.argsort(-concepts_similarity)
self.concepts = concepts[sorting_indices]
self.concepts_similarity = concepts_similarity[sorting_indices]
self.samples: int = samples
self.max_steps: int = max_steps
self.step: float = step
self.change_multiplier: float = change_multiplier
# We don't have to ask experts for the same concepts
self._checked_concepts: typing.Set[int] = set()
# Minimum similarity between vectors is -1
self._min_threshold: float = -1
# Maximum similarity between vectors is 1
self._max_threshold: float = 1
# Let's start from the highest similarity to ensure minimum amount of steps
self.threshold_: float = 1
def _ask_expert(self, available_concepts_indices):
# Get random concepts (the ones above the threshold)
concepts_to_show = set(
np.random.choice(
available_concepts_indices, len(available_concepts_indices)
).tolist()
)
# Remove those already presented to an expert
concepts_to_show = concepts_to_show - self._checked_concepts
self._checked_concepts.update(concepts_to_show)
# Print message for an expert and concepts to be classified
if concepts_to_show:
print("\nAre those concepts related to medicine?\n")
print(
"\n".join(
f"{i}. {concept}"
for i, concept in enumerate(
self.concepts[list(concepts_to_show)[: self.samples]]
)
),
"\n",
)
return input("[y]es / [n]o / [any]quit ")
return "y"
# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
if decision.lower() == "y":
# You can't go higher as current threshold is related to medicine
self._max_threshold = self.threshold_
if self.threshold_ - self.step < self._min_threshold:
return False
# Lower the threshold
self.threshold_ -= self.step
return True
if decision.lower() == "n":
# You can't got lower than this, as current threshold is not related to medicine already
self._min_threshold = self.threshold_
# Multiply threshold to pinpoint exact spot
self.step *= self.change_multiplier
if self.threshold_ + self.step < self._max_threshold:
return False
# Lower the threshold
self.threshold_ += self.step
return True
return False
def fit(self):
for _ in range(self.max_steps):
available_concepts_indices = np.nonzero(
self.concepts_similarity >= self.threshold_
)[0]
if available_concepts_indices.size != 0:
decision = self._ask_expert(available_concepts_indices)
if not self._parse_expert_decision(decision):
break
else:
self.threshold_ -= self.step
return self
class Classifier:
def __init__(self, centroid, threshold: float):
self.centroid = centroid
self.threshold: float = threshold
def predict(self, concepts_pipe):
predictions = []
for concept in concepts_pipe:
predictions.append(self.centroid.similarity(concept) > self.threshold)
return predictions
if __name__ == "__main__":
nlp = spacy.load("en_vectors_web_lg")
centroid = nlp("medicine")
concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
concepts
)
learner = ActiveLearner(
np.array(concepts), concepts_similarity, samples=20, max_steps=50
).fit()
print(f"Found threshold {learner.threshold_}\n")
classifier = Classifier(centroid, learner.threshold_)
pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096)
predictions = classifier.predict(pipe)
print(
"\n".join(
f"{concept}: {label}"
for concept, label in zip(concepts[20:40], predictions[20:40])
)
)
</code></pre>
<p>在回答了一些问题之后,在阈值为0.1的情况下(介于<code>[-1, 0.1)</code>之间的所有内容都被认为是非医疗性的,而{<cd17>}之间的所有内容都被认为是医学的)我得到了以下结果:</p>
<pre><code>kartagener s syndrome: True
summer season: True
taq: False
atypical neuroleptic: True
anterior cingulate: False
acute respiratory distress syndrome: True
circularity: False
mutase: False
adrenergic blocking drug: True
systematic desensitization: True
the turning point: True
9l: False
pyridazine: False
bisoprolol: False
trq: False
propylhexedrine: False
type 18: True
darpp 32: False
rickettsia conorii: False
sport shoe: True
</code></pre>
<p>正如您所看到的,这种方法还远远不够完美,因此最后一节介绍了可能的改进:</p>
<h2>可能的改进</h2>
<p>正如一开始提到的那样,使用我的方法和其他答案混合使用可能会忽略不计像<code>sport shoe</code>属于<code>medicine</code>的想法和主动学习方法在上面提到的两种启发式方法之间平局的情况下,更像是决定性的投票。在</p>
<p>我们也可以建立一个积极的学习团体。代替一个阈值,比如0.1,我们将使用多个阈值(增加或减少),假设这些是<code>0.1, 0.2, 0.3, 0.4, 0.5</code>。在</p>
<p>假设<code>sport shoe</code>得到,对于每个阈值,它分别是{<cd22>},如下所示:</p>
<p><code>True True False False False</code></p>
<p>如果投多数票,我们将以2票中的3票将其标记为{<cd24>}。此外,如果低于这个阈值的阈值投了反对票,那么太严格的阈值也会得到缓解(如果<code>True/False</code>看起来像这样:<code>True True True False False</code>)。在</p>
<p><strong>最后一个可能的改进是我提出的</strong>:在上面的代码中,我使用<code>Doc</code>向量,这是创建概念的词向量的一种方式。假设一个单词缺失(由零组成的向量),在这种情况下,它将被推离<code>medicine</code>质心。您可能不希望这样(因为一些利基医学术语[缩写如<code>gpv</code>或其他词]可能缺少它们的表示形式),在这种情况下,您只能平均那些与零不同的向量。在</p>
<p>我知道这篇文章很长,所以如果你有任何问题,请把它们贴在下面。在</p>