美女找不到特克斯

testcount = 0 titles1 = [] bodies1 = [] times1 = [] data = pd.read_csv('URLsALLjun27.csv', header=None) for url in data[0]: try: html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html, "lxml") titlemess = soup.find(id="title").get_text() #getting the title titlestring = str(titlemess) #make it a string title = titlestring.replace("\n", "").replace("\r","") titles1.append(title) bodymess = soup.find(class_="article").get_text() #get the body with markup bodystring = str(bodymess) #make body a string body = bodystring.replace("\n", "").replace("\u3000","") #scrub markup bodies1.append(body) #add to list for export timemess = soup.find('span',{"class":"time"}).get_text() timestring = str(timemess) time = timestring.replace("\n", "").replace("\r","").replace("年", "-").replace("月","-").replace("日", "") times1.append(time) testcount = testcount +1 #counter print(testcount) except Exception as e: print(testcount, e)

1条回答

网友

1楼 · 发布于 2024-05-19 01:13:37

问题是网站上没有class="article"和{}相同的图片。因此，您似乎必须检测网站上是否有图片，然后如果有图片，请按如下方式搜索日期和文本：

对于日期，请尝试：

timemess = soup.find(id="pubtime").get_text()

对于正文，这篇文章似乎只是图片的标题。因此，您可以尝试以下操作：

^{pr2}$

简而言之，soup.find('img')找到图像，findNext()转到下一个包含文本的块。在

因此，在您的代码中，我将执行以下操作：

try:
    bodymess = soup.find(class_="article").get_text()

except AttributeError:
    bodymess = soup.find('img').findNext().get_text()

try:
    timemess = soup.find('span',{"class":"time"}).get_text()

except AttributeError:
    timemess = soup.find(id="pubtime").get_text()

作为网页抓取的一般流程，我通常使用浏览器去网站本身，先在浏览器中找到网站后端的元素。在

相关问题更多 >

编程相关推荐

热门问题

热门文章