基于sklearn-kmeans的文本聚类

2024-10-04 11:21:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我在一个包含医学术语的表上运行文本聚类,我想对具有相似单词的字符串进行聚类,如果两个字符串有两个或多个单词,那么应该将它们包含在一个集群中,而不是只有一个共同的单词。在

我尝试了很多技巧,但没有得到任何有效的结果!我首先尝试了同时使用kmeans和凝聚聚类(三种连锁方法:ward、complete和average)来使用Levenshtein距离。它返回的结果很差,并且这个度量组合了具有部分相似字母的单词,如“dog”和“door”。在

我把距离度量改为使用TF-IDF,然后运行余弦相似度,然后用每个值减去1(距离=1-相似性)将相似度转换为距离,因为我尝试了2*acosine(相似度)的wiki方法,它返回了nan值!在

不管怎样,用这个距离度量,我也尝试了这两种算法,它总体上返回了很好的集群,除了一个巨大的集群,它们之间不包含类似的单词!无论我如何改变簇数的值,这个巨大的簇仍然会出现,即使我选择了k的大数(接近n,输入的长度),它通常出现在开始,无论是0,1,2,3。。为什么会这样??我做错什么了?我的数据集长度超过5000。这是集群输出的一部分。在

 cluster no 0:['Prolonged INR', 'Prolonged PTT', 'Prolonged QT Interval']
 cluster no 1:['GI bleeding', 'Gastrointestinal (GI) Bleeding', 'Lower GI bleeding']
 cluster no 2:['ACS', 'Acetazolamide', 'Achondroplasia', 'Acrocyanosis', 'Acromegaly', 'Adenoidectomy', 'Adenomyosis', 'Afebrile', 'Antihistamine', 'Apheresis', 'Aplasia', 'Argatroban', 'Arthralgia', 'Arthrocentesis', 'Arthrography', 'Arthroplasty', 'Asbestosis', 'Ascorbate', 'Asian', 'Asterixis', 'Astigmatism', 'Astrocytoma', 'Asymptomatic', 'Atelectasis', 'Atherosclerosis', 'Atropine', 'Audiogram', 'Autonomic Dysreflexia', 'Autopsy', 'Bacteremia', 'Balanitis', 'Balanoposthitis', 'Breastfeeding', 'Breech Presentation', 'Bronchiectasis', 'Bronchiolitis', 'Bronchospasm', 'Cachexia', 'Caf� Au Lait Spot', 'Calcaneovalgus', 'Chalazion', 'Chemistry Panels', 'Chills', 'Cholelithiasis', 'Cholera', 'Chondroblastoma', 'Chondrosarcoma', 'Chorioamnionitis', 'Chorionic Villus Sampling (CVS)', 'Choroid Plexus Papilloma (CPP)', 'Circumcision', 'Citrate', 'Claudication', 'Clonus', 'Coccidioidomycosis', 'Coccygodynia', 'Costochondritis', 'Craniectomy', 'Craniofacial Anomalies', 'Craniopharyngioma', 'Craniosynostosis', 'Craniotomy', 'Cri du Chat', 'Croup', 'Cryofibrinogen', 'Cryoglobulin', 'Cyclophosphamide', 'Cystometry', 'D-Dimer', 'Dacryocystitis', 'Dacryocystorhinostomy (DCR)', 'Dacryostenosis', 'Dantrolene', 'Deformational Plagiocephaly', 'Delusions', 'Demeclocycline', 'Dentures', 'Dermabrasion', 'Deviated Septum', 'Electrolytes', 'Electronystagmography (ENG)', 'Embolectomy', 'Emmetropia', 'Empyema', 'Enchondroma', 'Encopresis', 'Enterovirus', 'Ependymoma', 'Epididymitis', 'Epirubicin', 'Episiotomy', 'Epispadias', 'Eribulin', 'Erythroderma', 'Esophagectomy', 'Essential Tremor', 'Foraminotomy', 'Frostnip/Frostbite', 'Gallstones', 'Gastritis', 'Gastrojejunostomy', 'Gastroschisis', 'Giardiasis', 'Gingivitis', 'Gingivostomatitis', 'Glaucoma', 'Gliomas', 'Glomerulonephritis', 'Glomerulosclerosis', 'Group B Streptococcus', 'Herpangina', 'Hiccups', 'Hidradenitis Suppurativa', 'Hirsutism', 'Hookworm', 'Hordeolum (Stye)', 'Hydatidiform Mole', 'Hydration', 'Hydrocelectomy', 'Hydrops Fetalis', 'Hyperbilirubinemia', 'Hyperlipidemia', 'Hyperopia', 'Hyperphosphatemia', 'Hyperreflexia', 'Hypnosis', 'Hypoparathyroidism', 'Hypopituitarism', 'Hypovolemia', 'Hypoxia', 'Hysterosalpingogram (HSG)', 'Hysteroscopy', 'Intussusception', 'Irritability', 'Isoproterenol', 'Ixabepilone', 'Jewish', 'Karyotype', 'Keratoconus', 'Ketonemia', 'Ketonuria', 'Kyphoplasty', 'Kyphosis', 'Labyrinthitis', 'Lactulose', 'Laminectomy', 'Laminotomy', 'Lapatinib', 'Laryngectomy', 'Laryngitis', 'Laryngomalacia', 'Laryngoscopy', 'Laxative', 'Lymphadenitis', 'Lymphangitis', 'Lymphocele', 'Malaise', 'Malaria', 'Malocclusion', 'Mammography', 'Mannitol', 'Mastalgia', 'Mastectomy', 'Mastitis', 'Mastoidectomy', 'Mastopexy', 'Mediastinoscopy', 'Megaureter', 'Melena', 'Meningioma', 'Menopause', 'Menorrhagia', 'Menstruation', 'Metatarsalgia', 'Metatarsus Adductus', 'Metoclopramide', 'Neomycin', 'Nephrectomy', 'Nephrolithiasis', 'Neuromyelitis Optica', 'Neurosonography', 'Neurosurgery', 'Nocturnal Enuresis', 'Norovirus', 'Pericardectomy', 'Perimenopause', 'Periventricular Leukomalacia', 'Pertuzumab', 'Phimosis', 'Phobia', 'Photorefractive Keratectomy (PRK)', 'Phytophotodermatitis', 'Pilomatrixoma', 'Pinworms', 'Pityriasis Rosea', 'Plain radiograph', 'Platelets', 'Pleurisy', 'Pneumococcus', 'Pneumoconiosis', 'Pneumonectomy', 'Psychosis', 'Pterygium', 'Ptosis', 'Pulpitis (Toothache)', 'Pyeloplasty', 'Quantitative Immunoglobulins', 'Rabies', 'Rales', 'Red wale marks', 'Refractive Error', 'Smallpox', 'Smoking Cessation', 'Snoring', 'Sonohysterography', 'Spasmodic Dysphonia', 'Spina Bifida', 'Terlipressin', 'Tetany', 'Thoracotomy', 'Thrombocythemia', 'Thrombophilia', 'Thrombophlebitis', 'Thyroidectomy', 'Tinnitus', 'Tonsillar enlargement', 'Torn Annulus', 'Toxoplasmosis', 'Trabeculectomy', 'Ureterolysis', 'Ureteroplasty', 'Ureterosigmoidostomy', 'Urethritis', 'Urethroplasty', 'Uroflowmetry', 'Urostomy', 'Urticaria (Hives)', 'Uvulitis', 'Uvulopalatopharyngoplasty (UPPP)', 'Valsalva Maneuver', 'Varicella (Chickenpox)', 'Vasculitis', 'Vasopressin', 'Vasopressor', 'Venography', 'Ventriculostomy', 'Vertebroplasty', 'Vesicoureteral Reflux (VUR)', 'Osteochondritis Dissecans (OCD)', 'Osteochondroma', 'Osteogenesis Imperfecta (OI)', 'Osteopenia', 'Osteophyte formation', 'Osteosarcoma', 'Overuse Injuries', 'Overweight', 'Pallister Killian', 'Pallor', 'Palpitation', 'Palpitations', 'Paraesthesia', 'Paranoia', 'Paraphimosis', 'Parasomnias', 'Parathyroidectomy', 'Paronychia', 'Parotidectomy', 'Peaked T waves', 'Pemphigus Vulgaris', 'Lepirudin', 'Lethargy', 'Letrozole', 'Lichen Planus', 'Liposarcoma', 'Listeriosis', 'Living will', 'Lordosis', 'Excessive urination', 'Exemestane', 'Exploratory Laparotomy', 'Facelift (Rhytidectomy)', 'Fainting', 'Fibrinogen', 'Fibromyalgia', 'Fluorouracil', 'Folliculitis', 'Fondaparinux', 'Bedbound', 'Bedrest', 'Bevacizumab', 'BiPAP', 'Biloma', 'Birthmark', 'Bisphosphonate', 'Bivalirudin', 'Blepharitis', 'Blepharoplasty', 'Blindness', 'Blister', 'Bloodborne Pathogens', 'Allopurinol', 'Alopecia', 'Amblyopia', 'Amenorrhea', 'Amniocentesis', 'Anastrozole', 'Anencephaly', 'Angiodysplasia', 'Angioembolization', 'Ankyloglossia', 'Ankylosing Spondylitis', 'Haptoglobin', 'HbA1C', 'Heatstroke', 'Height', 'Heliox', 'Hematemesis', 'Hematochezia', 'Hematocrit', 'Hematology', 'Hemifacial Microsomia', 'Hemochromatosis', 'Hemoglobinuria', 'Hemophagocytic Lymphohistiocytosis (HLH)', 'Hemothorax', 'Hepatoblastoma', 'Hepatomegaly', 'Hepatosplenomegaly', 'Hepatotoxicity', 'Her2neu', 'IgG Deficiencies', 'Ileostomy', 'Impetigo', 'Improving', 'Impulsiveness', 'Incontinentia Pigmenti', 'Restlessness', 'Retinitis Pigmentosa', 'Retinoblastoma', 'Reversible Dementias', 'Rhabdomyosarcoma', 'Rhinoplasty', 'Rifaximin', 'Rosacea', 'Roseola', 'STEMI', 'Sacroiliitis', 'Scabies', 'Schistocytes', 'Sciatica', 'Scleral Buckling', 'Scleroderma', 'Sclerotherapy', 'Scotoma', 'Selective Mutism', 'Digitalization', 'Dihydroergotamine', 'Discogram', 'Dislocations', 'Disorientation', 'Diverticulosis', 'Docetaxel', 'Domperidone', 'Dopamine', 'Doxorubicin', 'Drooling', 'Drowsiness', 'Duodenitis', "Dupuytren's Contracture", 'Dyskeratosis Congenita', 'Dyslipidemia', 'Dysmenorrhea', 'Dysphasia', 'Dyssomnias', 'Dysthymia', 'Dysuria', 'ESR', 'Eclampsia', 'Ectropion (Eublepharon)', 'Ehrlichiosis', 'Translocations', 'Transverse Myelitis', 'Trastuzumab', 'Trigeminal Neuralgia', 'Tympanoplasty', 'Unconscious', 'Underweight', 'Undescended Testes (Cryptorchidism)', 'Ureter obstructed', 'Colchicine', 'Coldness', 'Colectomy', 'Coloboma', 'Colostomy', 'Colposcopy', 'Comfort Measures Only (CMO)', 'Comorbid conditions', 'Compromised local circulation', 'Conivaptan', 'Constipation', 'Continence', 'Cor Pulmonale', 'Splinters', 'Spondylolisthesis', 'Spondylolysis', 'Stapedectomy', 'Steroid', 'Stillbirth', 'Stomatitis', 'Strabismus (Crossed Eyes)', 'Stridor', 'Stupor', 'Suicide plan', 'Sunburn', 'Suprasternal retractions', 'Sympathectomy', 'Tapeworm', 'Tattoo', 'Tau/A Beta42', 'Teething', 'Telangiectasias', 'Temper Tantrum', 'Temporal Arteritis', 'Microbiology', 'Microcephaly', 'Microdiskectomy', 'Micropenis', 'Midodrine', 'Miscarriage', 'Modified duke criteria', 'Molluscum Contagiosum', 'Monoamniotic twins', 'Mosaicism', 'Motorcycle accident', 'Myalgias', 'Myasthenia Gravis', 'Myelogram', 'Myoclonus', 'Myoglobinuria', 'Myopia', 'Myositis', 'Myxedema', 'NSAID', 'Narcolepsy', 'Nausea', 'Poliomyelitis', 'Poly-pharmacy', 'Polyhydramnios (Hydramnios)', 'Polymyalgia Rheumatica', 'Polymyositis', 'Postictal State', 'Presbycusis', 'Presbyopia', 'Presyncope', 'Proctectomy', 'Proctocolectomy', 'Pruritis Ani', 'Pseudotumor Cerebri', 'Vinorelbine', 'Vitrectomy', 'Voiding Cystourethrogram (VCUG)', 'Vomit', 'Vulvitis', "Wegener's Granulomatosis", 'Whiplash', 'Widening QRS', 'Wrinkles', 'X-linked Agammaglobulinemia', 'YAG Capsulotomy', 'Yersiniosis', 'caffeine', 'coagulopathy', 'dexamethasone', 'Infliximab', 'Insomnia', 'Insulinoma', 'Intravenous contrast extravasation', 'Obtundation', 'Octreotide', 'Odynophagia', 'Oligodendroglioma', 'Oligohydramnios', 'Oliguria', 'Omphalocele', 'Onychomycosis', 'Oophorectomy', 'Orchiectomy', 'Orchitis', 'Orthopnea', 'Carboplatin', 'Cardiomegaly', 'Cataracts', 'Cecostomy', 'Cephalopelvic Disproportion (CPD)']
 cluster no 3:['Brain Malignancy', 'Brain metastasis']
 cluster no 4:['Pubic Lice', 'Lice', 'Head Lice']
 cluster no 5:['Assistive, Adaptive, Supportive or Protective Device Fitting', 'Gait Training Using an Assistive Device', 'Unsteady gait']
 cluster no 6:['Removal of Soft Tissue Foreign Body', 'Soft Tissue Foreign Body']
 cluster no 7:['Necrotizing pneumonia', 'Pneumocystis Pneumonia', 'Pneumocystis pneumonia', 'Pneumonia', 'Pneumonia', 'Mycoplasma Pneumonia', 'Walking Pneumonia']
 cluster no 8:['Esophageal Atresia', 'Esophageal Dilation', 'Esophageal Manometry', 'Esophageal ring/web', 'Esophageal stricture']

我做错什么了?我的技术有问题吗? 下面是我的代码,我使用sklearn包轻松地更改为其他技术:

^{pr2}$

编辑: 我现在只使用余弦相似度,而且得到了同样的问题,大集群与不相关的单词,所以这不是tf-idf问题!在

    WORD = re.compile(r'\w+')

    def get_cosine(vec1, vec2):
         intersection = set(vec1.keys()) & set(vec2.keys())
         numerator = sum([vec1[x] * vec2[x] for x in intersection])

         sum1 = sum([vec1[x]**2 for x in vec1.keys()])
         sum2 = sum([vec2[x]**2 for x in vec2.keys()])
         denominator = math.sqrt(sum1) * math.sqrt(sum2)

         if not denominator:
            return 0.0
         else:
            return float(numerator) / denominator

    def text_to_vector(text):
         words = WORD.findall(text)
         return Counter(words)


k=len(my_list)

data1 = np.zeros((k,k))

for i,string1 in enumerate(my_list):
   for j,string2 in enumerate(my_list):
        data1[i][j] = 1-get_cosine(text_to_vector(string1), text_to_vector(string2))

print(data1)
k=len(my_list)
data2=np.asarray(data1)
arr_3d = data2.reshape((1,k,k))

编辑:我运行LSA而不是TF-IDF,它应该适合短文本,但是我得到了非常糟糕的结果!不匹配的群集:

vectorizer = CountVectorizer(min_df = 1, stop_words = 'english')
dtm = vectorizer.fit_transform(my_list)

lsa = TruncatedSVD(2, algorithm = 'arpack')
dtm_lsa = lsa.fit_transform(dtm)
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
similarity = np.asarray(numpy.asmatrix(dtm_lsa) * numpy.asmatrix(dtm_lsa).T)
#print(1-similarity)
k=len(my_list)
dist1 = np.subtract(np.ones((k,k),dtype=np.float), similarity)
#dist1.astype(float)
print(dist1)

Tags: no距离formynp集群单词list
1条回答
网友
1楼 · 发布于 2024-10-04 11:21:50

k均值基于方差最小化。在

它使每个对象x、维数i和最优(最小成本)中心center的方差平方和,(x[i]-center[i])**2。它不能使任意距离最小化(参见这里关于这个问题的许多许多许多问题)。在

代码中存在两个致命问题:

  • 任何基于余弦的方法所需的向量化只适用于长文本,如新闻文章。它不适用于tweet或任何其他短文本,因为它们的有用标记太少。作为一个经验法则,你将需要100+字每文本。在
  • kmeans必须应用于数据矩阵,而不是应用于距离矩阵。它需要计算原始数据的均值。因此,它需要原始的数据矩阵。此外,kmeans不使用成对距离,而只寻求点对中心的最小二乘法。在

相关问题 更多 >