用Python实现Web抓取到单字符串

import requests, re, json from bs4 import BeautifulSoup urls = ['http://t24.com.tr/haber/suriyelilere-vatandasliga-neden-karsi-cikiliyor,348652', 'http://t24.com.tr/haber/oteki-suriyeliler-turkiye-vatandasi-olursak-askere-gideriz-akpye-oy-verir-miyim-bilmiyorum,349206', 'http://t24.com.tr/haber/konyada-turklerle-suriyeliler-arasinda-kopege-niye-tekme-attin-kavgasi-3-olu-2-yarali,349208'] for url in urls: html = requests.get(url).text soup = BeautifulSoup(html, "html.parser") paragraphs = soup.findAll('p', {"class" : "p1"}) for p in paragraphs: text = p.text.replace(',', '').replace('"', '').replace('.', '').replace("'", "").replace('?', '').replace("\n", "").replace('\r', '') print(text)

Selin Girit Kendi ülkesinde savaştan kaçacak sınavsız okula girip askerlik yapmayacak 10 yıl sonra benden iyi yaşayacak #ÜlkemdeSuriyeliİstemiyorum Cumhurbaşkanı Recep Tayyip Erdoğanın Türkiyede yaşayan Suriyeli mültecilere

1条回答

网友

1楼 · 发布于 2024-06-23 02:41:59

首先：在对每个soup执行任何操作之前，先为第一个for循环中的每个url创建一个新的soup。因此，使用代码只能从urls中的最后一个url获取文本。你应该做的第一件事是把段落循环放在url循环中。你知道吗

soup.findAll()返回一个迭代器paragraphs，该迭代器包含页面上的所有p标记。在循环段落之前，您可以创建一个空字符串full_string，然后将每个段落添加到此空字符串中以获得所需的结果。如下所示。你知道吗

for url in urls:
    html = requests.get(url).text
    soup = BeautifulSoup(html, "html.parser")

    full_text = ''
    paragraphs = soup.findAll('p', {"class" : "p1"})
    for p in paragraphs:
        text = p.text.replace(',', '').replace('"', '').replace('.', '').replace("'", "").replace('?', '').replace("\n", "").replace('\r', '')
        full_text += text

    print text

相关问题更多 >

编程相关推荐

热门问题

热门文章