电子健康档案概念标注工具
medcat的Python项目详细描述
医疗oncept注释工具
一个简单的工具,用于从umls或任何其他源中进行概念注释。
演示
演示应用程序位于MedCAT。请注意这是关于药物的训练 并且包含一小部分umls(<;1%)。
使用pip
安装- 安装medcat
pip install --upgrade medcat
- 安装科学模型
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.0/en_core_sci_md-0.2.0.tar.gz
从下面的“型号”部分下载词汇表和CDB
使用方法:
frommedcat.catimportCATfrommedcat.utils.vocabimportVocabfrommedcat.cdbimportCDBvocab=Vocab()# Load the vocab model you downloadedvocab.load_dict('<path to the vocab file>')# Load the cdb model you downloadedcdb=CDB()cdb.load_dict('<path to the cdb file>')# create catcat=CAT(cdb=cdb,vocab=vocab)cat.train=False# Test itdoc="My simple document with kidney failure"doc_spacy=cat(doc)# Entities are indoc_spacy._.ents# Or to get a jsondoc_json=cat.get_json(doc)# To have a look at the results:fromspacyimportdisplacy# Note that this will not show all entites, but only the longest onesdisplacy.serve(doc_spacy,style='ent')# To train - unsupervised, set the train flag to True and run#documents through MedCATcat.train=True# To run cat on a large number of documents, this will#also run trainnig as the flag is set to True.data=[(<doc_id>,<text>),(<doc_id>,<text>),...]docs=cat.multi_processing(data)# To explicitly run trainnig you can dof=open("<some file with a lot of medical text>",'r')# If you want fine tune set it to True, old training will be preservedcat.run_training(f,fine_tune=True)
建立新概念数据库
frommedcat.catimportCATfrommedcat.utils.vocabimportVocabfrommedcat.cdbimportCDBvocab=Vocab()# Load the vocab model you downloadedvocab.load_dict('<path to the vocab file>')# If you have an existing CDBcdb=CDB()cdb.load_dict('<path to the cdb file>')# You can now add concepts from a CSV file, examples of the files can be found in ./examplespreparator=PrepareCDB(vocab=vocab)csv_paths=['<path to your csv_file>','<another one>',...]# e.g.csv_paths=['./examples/simple_cdb.csv']cdb=preparator.prepare_csvs(csv_paths)# Save the new CDB for latercdb.save_dict("<path to a file where it will be saved>")# Done
如果是从源头建造,则要求
python >= 3.5
其余的都可以使用requirements.txt文件中的pip
来安装,方法是运行:
pip install -r requirements.txt
结果
Dataset | SoftF1 | Description |
---|---|---|
MedMentions | 0.84 | The whole MedMentions dataset without any modifications or supervised training |
MedMentions | 0.828 | MedMentions only for concepts that require disambiguation, or names that map to more CUIs |
MedMentions | 0.97 | Medmentions filterd by TUI to only concepts that are a disease |
型号
为词汇表和cdb公开了一个基本的训练模型。它针对MedMentions
中提供的~35k个概念进行培训。它是相当有限的
所以表演可能不是最好的。
词汇Download-根据med提到的内容构建
cdb Download-根据mednessions构建
(注意:这是根据mednessions编译的,没有来自NLMas的任何数据) 该数据不公开。)
确认
实体提取是在MedMentions上训练的,它总共有~35k个来自umls的实体
本词典由Wiktionary汇编而成,共有~800k个独特单词For now NOT made publicaly available