刮削时处理按键错误

2024-07-04 06:57:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我目前正在编写一个脚本来从ClinicalTrials.gov中提取数据。为此,我编写了以下脚本:

def clinicalTrialsGov (id):
    url = "https://clinicaltrials.gov/ct2/show/" + id + "?displayxml=true"
    data = BeautifulSoup(requests.get(url).text, "lxml")
    studyType = data.study_type.text
    if studyType == 'Interventional':
        allocation = data.allocation.text
        interventionModel = data.intervention_model.text
        primaryPurpose = data.primary_purpose.text
        masking = data.masking.text
        enrollment = data.enrollment.text
    officialTitle = data.official_title.text
    condition = data.condition.text
    minAge = data.eligibility.minimum_age.text
    maxAge = data.eligibility.maximum_age.text
    gender = data.eligibility.gender.text
    healthyVolunteers = data.eligibility.healthy_volunteers.text
    armType = []
    intType = []
    for each in data.findAll('intervention'):
        intType.append(each.intervention_type.text)
    for each in data.findAll('arm_group'):
        armType.append(each.arm_group_type.text)
    citedPMID = tryExceptCT(data, '.results_reference.PMID')
    citedPMID = data.results_reference.PMID
    print(citedPMID)
    return officialTitle, studyType, allocation, interventionModel, primaryPurpose, masking, enrollment, condition, minAge, maxAge, gender, healthyVolunteers, armType, intType

但是,下面的脚本并不总是有效的,因为不是所有的研究都有所有的项目(例如,将出现KeyError)。为了解决这个问题,我可以简单地将每个语句包装为try except catch,如下所示:

try:
  studyType = data.study_type.text
except:
  studyType = ""

但这似乎是一个糟糕的方法来实现这一点。什么是更好/更干净的解决方案?你知道吗


Tags: text脚本datatypeconditiongendereachallocation
1条回答
网友
1楼 · 发布于 2024-07-04 06:57:29

这是个好问题。在我处理它之前,让我说您应该考虑将BeautifulSoup(BS)构造函数的第二个参数从lxml更改为xml。否则,BS不会将解析后的标记标记为XML(要自己验证这一点,请访问代码中data变量的is_xml属性)。你知道吗

通过将所需元素名称的列表传递给find_all()方法,可以避免在尝试访问不存在的元素时生成错误:

subset = ['results_reference','allocation','interventionModel','primaryPurpose','masking','enrollment','eligibility','official_title','arm_group','condition']

tag_matches = data.find_all(subset)

然后,如果要从标记列表中获取特定元素而不进行迭代,可以使用标记名作为键将其转换为dict:

tag_dict = dict((tag_matches[i].name, tag_matches[i]) for i in range(0, len(tag_matches)))

相关问题 更多 >

    热门问题