职务描述句的分类方法

df <- data.frame(job_title = c("Recruiter","Recruiter","Recruiter","Recruiter", "File Clerk","File Clerk", "Learning & Org. Development Specialist","Learning & Org. Development Specialist","Learning & Org. Development Specialist","Learning & Org. Development Specialist", "CNA","CNA","CNA"), job_experience = c("Minimum 1 year experience in recruitment or related human resources function.", "Proficient in Microsoft Office Applications.", "High school diploma required.", "Bachelors Degree in Human Resources or related field preferred.", "High School diploma preferred.", "Ability to use relevant computer systems.", "Bachelors Degree in related field (e.g., Human Resources, Education, Organizational Development).", "Minimum 2 years experience applying L&OD principles and practices in an organizational setting.", "Previous work experience in Human Resources preferred.", "Experience with a learning management system (LMS).", "High school diploma or GED equivalent.", "Certified Nursing Assistant, certified by the Virginia Board of Health Professions.", "CPR certification required at date of hire."))

job_title job_experience job_exp_category "Recruiter" "Minimum 1 year experience in recruitment..." "Work experience" "Recruiter" "Proficient in Microsoft Office Applicati..." "Skill/Ability" "Recruiter" "High school diploma required." "Degree" ... ... ... "CNA" "Certified Nursing Assistant, certificati..." "Certification/License" "CNA" "CPR certification required at date of hire." "Certification/License"

1条回答

网友
1楼 · 发布于 2024-06-26 17:37:06

如果有人看到这篇文章并有类似的需求，下面是我（OP）最终要做的：
在the content in this link之后，我结合使用监督学习（随机森林）将工作描述分为四类（学位、工作经验、证书/许可证和ksa）和无监督学习（kmeans聚类分析）来收集使用类似词汇的工作经验陈述（例如。，cluster 1=引用Microsoft office产品的语句）
一般过程包括：
第一阶段（确定与工作经验相关的工作描述）：
将职位描述表的样本手工编码到适当的类别中
将我的数据集转换成tidytext数据帧，为分析做好准备
使用手工编码的职务描述语句及其关联类别创建一个培训数据集，然后创建一个包含待分类职务描述语句的测试数据集
使用caret包估计随机森林模型（监督学习）[详细信息：method="ranger"，开箱即用重采样方法，树数=200]。我的OOB预测误差为2.96%
用于predict()预测剩余数据（即测试数据集）上的工作描述类别
第2阶段（将工作经验陈述分类到相关的桶中）：
我对这项任务的预测错误感到满意，于是我筛选出只包含“工作经验”下的职位描述语句。
在清理数据集以避免对不重要的词（例如，首选）进行聚类之后，我使用kmeans()聚类分析（k=200）将工作经验语句根据所使用的词进行聚类
在这一点上，我们仍在最终决定什么是最终的工作经验描述/分类，但这个过程现在更有效的修剪消失了，并有点领先于如何适当分类

相关问题更多 >

编程相关推荐

热门问题

热门文章