使用python进行Web抓取抽象https://ash.confex.com/ash/2019/webprogram/start.htm

2024-10-02 08:23:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图使用关键词过继细胞疗法、同种异体、自体、人工T细胞受体、BCMA、TACI、CD123从所有页面中提取抽象信息,如标题、日期、作者、从属关系、背景、方法、结果、结论

使用selenium,我尝试插入关键字并打开页面,但无法继续

import webbrowser
import os
import requests
from bs4 import BeautifulSoup
import sys
import wget
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome('chromedriver')
driver.get('https://ash.confex.com/ash/2018/webprogram/meeting.html#Friday')
src = driver.page_source  # gets the html source of the page
parser = BeautifulSoup(src)
mn=[]
list_of_attributes = {"class": "itemtitle"}  # A list of attributes that you want to check in a tag
tag = parser.findAll('div', attrs=list_of_attributes)

for ls in tag:
    for s in range(0,len(ls.contents)):
        try:
            if 'Session' in ls.contents[s].attrs['href']:
                mn.append('https://ash.confex.com/ash/2019/webprogram/'+ls.contents[s].attrs['href'])
        except:
            pass


# response = requests.post('https://ash.confex.com/ash/2019/webprogram/Session11552.html')
# soup = BeautifulSoup(response.text)
#
# list_of_attributes = {"class": "cricon"}  # A list of attributes that you want to check in a tag
# tag1 = soup.findAll('div', attrs=list_of_attributes)

dt=pd.DataFrame()
dt['Main Links']=mn
dt.to_excel('G:\Oct 18\ASH 18\Ash_Main_links2.xlsx')

我想要我的输出 标题-201与应答、CAR-T细胞体内扩增和重复输注CD19 CAR-T细胞后无进展生存率相关的因素临床相关摘要 课程:口头和海报摘要 类型:口腔 会议:704.免疫疗法I 血液病专题;途径: 生物学、ALL、白血病、疾病、治疗、淋巴瘤(any)、CLL、CAR Ts、非霍奇金淋巴瘤、DLBCL、临床相关、淋巴恶性肿瘤 2019年12月7日星期六:下午12:30 巴伦西亚A(W415A),4层(奥兰治县会议中心) 伊万德罗D。Bezerra,MD1,Jordan Gauthier,MD,MSc1,2,Alexandre V。平山,MD2,芭芭拉S。彭德,MSc2*,里德M。霍金斯,BS2*,埃莎·瓦基尔,MSc2*,雷切尔N。施泰因梅茨,BS2*,奥德G。Chapuis,MD1,2*,Brian G。Till,MD1,2,Hans-Peter Kiem,医学博士,PhD1,3,Mazyar Shadman,医学博士,MPH1,2*,Ryan D。卡萨迪,MD1,2,斯坦利R。里德尔,MD1,2*,大卫G。Maloney博士PhD1,3和Cameron J。海龟,MBBS,PhD1,2

<华盛顿大学医学部,西雅图 2佛瑞德·哈钦森癌症研究中心临床研究部,华盛顿州西雅图 3佛瑞德·哈钦森癌症研究中心,华盛顿州西雅图

背景 CD19靶向嵌合抗原受体工程(CD19 CAR)-T细胞免疫治疗在复发性或难治性(R/R)B细胞恶性肿瘤患者中显示出良好的疗效。重复输注CD19 CAR-T细胞的潜在益处尚不清楚,重复输注CD19 CAR-T细胞后与反应、CAR-T细胞体内扩增和无进展生存(PFS)相关的因素尚未研究

方法

我们分析了在我们机构进行的1/2期试验(NCT01865617)中第二次输注CD19 CAR-T细胞(CART2)后R/R B细胞恶性肿瘤患者的结果。CAR-T细胞治疗后的反应在输注后第28天左右进行评估,并根据2018年NCCN急性淋巴细胞白血病(ALL)指南、2018年iwCLL慢性淋巴细胞白血病(CLL)指南和卢加诺非霍奇金淋巴瘤(NHL)标准进行定义。Logistic、Cox和线性回归分别用于血液中应答、无进展生存率和峰值CD8+CAR-T的多变量分析。变量选择采用贝叶斯模型平均法

结果

44名患者的疗效可评估(全部,n=14;CLL,n=11;NHL,n=19)被纳入本研究。CART2时的中位年龄为58岁(范围23-73岁)。患者接受了大量的预治疗(中位先前治疗,6;范围2-13),16名患者(36%)有体积过大(≥ 5cm)淋巴结或髓外疾病。从首次CAR-T输注(CART1)到CART2的中位时间为70天(范围为28-712)。28名患者(64%)接受了CART1剂量≥ 2x106 CAR-T细胞/千克。15名患者(32%)对CART1无反应,22名患者(50%)在最初反应后复发或进展(完全反应[CR],n=15;对CART1的部分反应[PR],n=7);7(16%)在CART1之后在PR中接受CART2。表中显示了所有特性

我们观察了所有疾病类型的反应,包括14例all患者中的3例(21%;所有CR/CRi),11例CLL患者中有4例(36%;CR/CRi,n=3;部分缓解[PR],n=1),19例NHL患者中有9例(47%;CR,n=2;PR,n=7)。在对存活患者和应答患者进行43个月(范围16-66)的中位随访后,应答者的4年PFS概率估计为23%(95%CI,9-59%)。有应答者的4年总生存率为36%(95%可信区间19-71%),而无应答者为24%(95%可信区间12-47)卢比

多变量逻辑回归模型确定了CART2后反应的预测因子:CART1淋巴细胞耗竭(高强度环磷酰胺和氟达拉滨[CyFlu]与无CyFlu相比,OR=12.19,95%可信区间,1.10-1689.85,p=0.04),以及CART2后体内CAR-T细胞扩增峰值(OR=2.31/对数10 CD8+CAR-T细胞/µL增加,95%可信区间,1.17-5.29,p=0.01)

在多变量Cox模型中,CART2后CD8+CAR-T细胞的峰值较高(HR=0.47/对数10 CD8+CAR-T细胞/微升增加,95%可信区间,0.33-0.68,p<;0.001); CART2>;CART1细胞剂量与较长的PFS相关(HR=0.36,95%CI,0.16-0.86,p=0.02)。这表明,CART2后CD8+CAR-T细胞峰值和增加CART2峰值的因素(如预防免疫排斥反应或增加输注细胞剂量)是与CART2结果相关的关键因素。因此,我们研究了与更高的CD8+CART2峰值相关的因素。在多变量线性回归中,CART1-CyFlu预测在CART2之后CD8+CAR-T细胞峰值更高(高强度CyFlu与无CyFlu相比,p<;0.001 ; 调整疾病类型后(CLL与ALL,p=0.02;NHL与ALL(p=0.04)和血液中CD19+细胞总数(p=0.02)

CyFlu是CAR-T细胞治疗前最常用的淋巴细胞耗竭,我们通过在多变量模型中比较高强度和低强度CyFlu来评估CART1 CyFlu淋巴细胞耗竭强度的影响。逻辑回归显示,在CART1之前接受高强度CyFlu治疗的患者对CART2的反应概率高于低强度CyFlu治疗的患者(OR=3.83,95%CI,0.85-21.83,p=0.08)。在多变量分析中,在调整疾病类型和血液中CD19+细胞总数后,CART2后第60天,CART1高强度CyFlu与较高的CD8+CAR-T细胞数相关,而低强度CyFlu(p=0.01)

结论

我们的研究结果表明,第二次输注CD19 CAR-T细胞后,CART1前的高强度CyFlu淋巴细胞清除和CART2时增加CAR-T细胞剂量可能会改善预后


Tags: offromimport患者seleniumcar细胞webdriver
1条回答
网友
1楼 · 发布于 2024-10-02 08:23:13

这很棘手,因为每个链接的数据格式不同,但本质上,您可以通过在requests中传递参数来获取html,获取链接,然后转到每个链接并提取数据。也许有一个更优雅的方法,但这应该会让你走。我没有通读整个列表,因为它需要一段时间,但我得到了一个不错的块,并打印了前5行作为概念证明:

import requests
from bs4 import BeautifulSoup
import math
import pandas as pd




url = 'https://ash.confex.com/ash/2019/htsearch.cgi'

df = pd.DataFrame()

for keyword in ['Adoptive cell therapy', 'Allogeneic', 'Autologous', 'Artificial T-Cell Receptors', 'BCMA', 'TACI', 'CD123']:

    payload = {
    'words': '%s' %keyword,
    'method': 'and',
    'pge': '1',
    'submit': 'Search',
    'byDayany': '1',
    'bySymposiumany': '1',
    'byAudienceany': '1',
    'action': 'search',
    'source': 'webprogram',
    'webprogrammode': 'default',
    'excludecontenttype': '1'}


    response = requests.get(url, params=payload)
    soup = BeautifulSoup(response.text, 'html.parser')

    tot_pages = math.ceil(int(soup.find('b').text)/10)

    for page in range(1,tot_pages+1):
        payload = {
        'words': '%s' %keyword,
        'method': 'and',
        'pge': '%s' %page,
        'submit': 'Search',
        'byDayany': '1',
        'bySymposiumany': '1',
        'byAudienceany': '1',
        'action': 'search',
        'source': 'webprogram',
        'webprogrammode': 'default',
        'excludecontenttype': '1'}

        response = requests.get(url, params=payload)
        soup = BeautifulSoup(response.text, 'html.parser')

        resultList = soup.find_all('li')
        for each in resultList:
            href = each.find('a')['href']

            link_url = 'https://ash.confex.com/ash/2019/webprogram/' + href
            response_alpha = requests.get(link_url)
            soup_alpha = BeautifulSoup(response_alpha.text, 'html.parser')

            headers = soup_alpha.find_all('span', {'class':'header'})

            header_col = []
            header_val = []
            for head in headers:
                a = head.text

                if head.next_sibling.name == 'br':
                    b = head.next_sibling.next_sibling
                else:    
                    b = head.next_sibling.strip()
                header_col.append(a)
                header_val.append(b)

            title = ' '.join(soup_alpha.find('h2').text.strip().split())

            print (title)

            time = soup_alpha.find('div', {'class':'datetime header'}).text.strip()
            loc = ' '.join(soup_alpha.find('div', {'class':'location'}).text.strip().split())

            try:
                authors = soup_alpha.find('div', {'class':'paperauthors'}).text.strip()
            except:
                authors = 'N/A'

            try:
                abstract = soup_alpha.find('div', {'class':'abstract'}).text.strip()
            except:
                abstract = 'N/A'

            try:
                disclosure = soup_alpha.find('div', {'class':'disclosure'}).text.strip()
            except:
                disclosure = 'N/A'

            data = header_val + [title, time, loc, authors, abstract, disclosure]
            col = header_col + ['title','time','location','authors','abstract','disclosure']

            temp_df = pd.DataFrame([data], columns=col)



            df = df.append(temp_df, sort=True).reset_index(drop=True)

输出:

print (df.head(5).to_string())
               Hematology Disease Topics & Pathways:                   Program:                                           Session: Type:                                           abstract                                            authors                                         disclosure                                           location                                         time                                              title
0  Biological, Therapies, CAR-Ts, Technology and ...  Oral and Poster Abstracts  703. Adoptive Immunotherapy: Mechanisms and Ne...   NaN  Success of adoptive T cell therapy (ATT) is de...  Stefanie Herda, PhD1*, Andreas Heimann, MSc1,2...  Disclosures: Bullinger: Bayer: Other: Financin...  Hall B, Level 2 (Orange County Convention Center)  Saturday, December 7, 2019, 5:30 PM-7:30 PM  1943 Long-Term T Cell Expansion Results in Inc...
1  Therapies, Technology and Procedures, cell exp...  Oral and Poster Abstracts  703. Adoptive Immunotherapy: Mechanisms and Ne...   NaN  The treatment of haematological malignancies w...  André Simoes, PhD*, Joanna Kawalkowska, PhD*, ...  Disclosures: Simoes: GammaDelta Therapeutics L...  Hall B, Level 2 (Orange County Convention Center)    Sunday, December 8, 2019, 6:00 PM-8:00 PM  3221 Vδ1+ T Cells: Adoptive Cell Therapy for t...
2  Diseases, Leukemia, antibodies, Biological, AM...  Oral and Poster Abstracts                     704. Immunotherapies: Poster I   NaN                                       Introduction  Rajneesh Nath, MD1, Eileen M Geoghegan2*, Matt...  Disclosures: Nath: Astellas: Consultancy; Daii...  Hall B, Level 2 (Orange County Convention Center)  Saturday, December 7, 2019, 5:30 PM-7:30 PM  1958 Sierra Clinical Trial Dosimetry Results S...
3  Diseases, Biological, Therapies, Hodgkin Lymph...  Oral and Poster Abstracts                    704. Immunotherapies: Poster II   NaN  BACKROUND: Hodgkin Lymphoma (HL) is characteri...  Fabio Guolo, MD1*, Paola Minetto, MD1, Filippo...  Disclosures: No relevant conflicts of interest...  Hall B, Level 2 (Orange County Convention Center)    Sunday, December 8, 2019, 6:00 PM-8:00 PM  3231 Adoptive Cell Therapy and Immune Check Po...
4  Diseases, Leukemia, ALL, Biological, AML, Ther...  Oral and Poster Abstracts  703. Adoptive Immunotherapy: Mechanisms and Ne...   NaN                                         Background  Hongbing Ma, MD, PhD1*, Ke Zeng, MD, PhD2*, Mi...  Disclosures: Iyer: Genentech/Roche: Research F...  Hall B, Level 2 (Orange County Convention Center)  Saturday, December 7, 2019, 5:30 PM-7:30 PM  1940 Adoptive Therapy with Cord Blood Regulato...

相关问题 更多 >

    热门问题