从临床医生的特定字段中获取数据

2024-09-29 23:19:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我写了一个函数,它给出了一个NCTID(即临床医生.GovID)它从临床医生.Gov公司名称:

def clinicalTrialsGov (nctid):
    data = BeautifulSoup(requests.get("https://clinicaltrials.gov/ct2/show/" + nctid + "?displayxml=true").text, "xml")
    subset = ['intervention_type', 'study_type', 'allocation', 'intervention_model', 'primary_purpose', 'masking', 'enrollment', 'official_title', 'condition', 'minimum_age', 'maximum_age', 'gender', 'healthy_volunteers', 'phase', 'primary_outcome', 'secondary_outcome', 'number_of_arms']
    tag_matches = data.find_all(subset)

然后我执行以下操作:

^{pr2}$

把这些数据转换成字典。但是,在有多个干预类型(例如NCT02170532)的情况下,这只需要一个干预类型。如何调整此代码,以便当有多个值的字段时,这些值将以逗号分隔的列表列出。在

电流输出:

ctOfficial_title: Aerosolized Beta-Agonist Isomers in Asthma
ctPhase: Phase 4
ctStudy_type: Interventional
ctAllocation: Non-Randomized
ctIntervention_model: Crossover Assignment
ctPrimary_purpose: Treatment
ctMasking: None (Open Label)
ctPrimary_outcome: 
Change in Maximum Forced Expiratory Volume at One Second (FEV1)
Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment

ctSecondary_outcome: 
Change in Dyspnea Response as Measured by the University of California, San Diego (UCSD) Dyspnea Scale
Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment

ctNumber_of_arms: 5
ctEnrollment: 10
ctCondition: Asthma
ctIntervention_type: Drug
ctGender: All
ctMinimum_age: 18 Years
ctMaximum_age: N/A
ctHealthy_volunteers: No

期望输出:

ctOfficial_title: Aerosolized Beta-Agonist Isomers in Asthma
ctPhase: Phase 4
ctStudy_type: Interventional
ctAllocation: Non-Randomized
ctIntervention_model: Crossover Assignment
ctPrimary_purpose: Treatment
ctMasking: None (Open Label)
ctPrimary_outcome: 
Change in Maximum Forced Expiratory Volume at One Second (FEV1)
Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment

ctSecondary_outcome: 
Change in Dyspnea Response as Measured by the University of California, San Diego (UCSD) Dyspnea Scale
Baseline (before treatment), 30 minutes, 1, 2, 4, 6, and 8 hours post treatment

ctNumber_of_arms: 5
ctEnrollment: 10
ctCondition: Asthma
ctIntervention_type: Drug, Drug, Other, Device, Device, Drug
ctGender: All
ctMinimum_age: 18 Years
ctMaximum_age: N/A
ctHealthy_volunteers: No

我如何调整代码,使其能够清除所有的干预类型?在


Tags: andofinagetypechangebeforehours
2条回答

您的代码失败,因为它正在覆盖给定字典键的先前值。相反,您需要附加到现有条目。在

您可以使用Python的defaultdict()。这可以用来为每个键自动创建列表。如果有多个条目,则每个条目都会附加到该关键字的列表中。然后在打印时,如果需要,可以使用,分隔符将列表重新连接在一起:

import bs4
from collections import defaultdict    
from bs4 import BeautifulSoup    
import requests

def clinicalTrialsGov(nctid):
    data = defaultdict(list)
    soup = BeautifulSoup(requests.get("https://clinicaltrials.gov/ct2/show/" + nctid + "?displayxml=true").text, "xml")
    subset = ['intervention_type', 'study_type', 'allocation', 'intervention_model', 'primary_purpose', 'masking', 'enrollment', 'official_title', 'condition', 'minimum_age', 'maximum_age', 'gender', 'healthy_volunteers', 'phase', 'primary_outcome', 'secondary_outcome', 'number_of_arms']

    for tag in soup.find_all(subset):
        data['ct{}'.format(tag.name.capitalize())].append(tag.get_text(strip=True))

    for key in data:
        print('{}: {}'.format(key, ', '.join(data[key])))

clinicalTrialsGov('NCT02170532')

这将显示以下内容:

^{pr2}$

您看到的是最后一个标记值,因为之前的所有值都将被下一个值覆盖。您需要检查字典中是否已存在某个键,如果存在,则句柄也相应。
像这样:

tag_dict = {}
for i in range(0, len(tag_matches)):
    if(str('ct' + tag_matches[i].name.capitalize())) in tag_dict:
         tag_dict[str('ct' + tag_matches[i].name.capitalize())] += ', '+tag_matches[i].text
    else:
         tag_dict[(str('ct' + tag_matches[i].name.capitalize()))]= tag_matches[i].text

相关问题 更多 >

    热门问题