在不重复已保存数据的情况下擦除数据

2024-09-30 01:21:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个简单的beautifulsoup脚本,它定期从页面中提取数据,并将其保存为json文件。但是,每次运行时,它都会遍历许多相同的url集,并刮取许多相同的数据以及发布的任何新数据。如何避免重复

我尝试过对已经被刮去的url进行酸洗,但不知道如何构建逻辑来阻止刮去过程中不必要的重复

    for i in urlrange:
        urlbase = 'https://www.example.com/press-releases/Pages/default.aspx?page='
        targeturl = urlbase+str(i)
        req = requests.get(targeturl)
        r = req.content
        soup = BeautifulSoup(r,'lxml')
        for row in soup.find_all('table', class_='t-press'):
            for link in row.find_all('a'):
                link = link.get('href')
                link = 'https://www.example.com' + link
                if link not in datalinks:
                    datalinks.append(link)
                    #print('New link found!')
                else:
                    continue

    pickling_on = open("links_saved.pkl","wb")
    pickle.dump(datalinks, pickling_on)
    pickling_on.close()

    for j in datalinks:
        req = requests.get(j)
        r = req.content
        soup = BeautifulSoup(r,'lxml')
        for textdata in soup.find_all('div', class_='content-slim'):
            textdata = textdata.prettify()
            data.append({j:textdata})   

    json_name = "Press_Data_{}.json".format(time.strftime("%d-%m-%y"))

    with open(json_name,'w') as outfile:
        json.dump(data,outfile)

我想刮的数据,而不必去的网址已经被脚本处理


Tags: 数据injsonforgetonlinkcontent
2条回答

尝试将链接存储在一个集合中

datalinks = [ ]
unique_links = set(datalinks)

这将删除所有重复的链接,所以现在只有唯一的链接将被处理

尝试以下操作:

listwithdups = [ 'url1', 'url2', 'url3', 'url2', 'url4', 'url4' ]    
uniqueList = [ i for i in listwithdups if i not in uniqueList ]

分解列表:

listwithdups = [ 'url1', 'url2', 'url3', 'url2', 'url4', 'url4' ]    
uniqueList = [] #declaring empty list

for i in listwithdups:
 if i not in uniqueList:
  uniqueList.append(i)

相关问题 更多 >

    热门问题