通过URL列表优化刮取并写入cs

2024-09-29 20:21:59 发布

您现在位置:Python中文网/ 问答频道 /正文

有一个20k+网址的csv我想刮,找到html元素“超级属性选择”。如果找到,将url写入A列,同时将产品编号(sku)写入B列。如果找不到,将url写入C列,将sku写入D列。最后,将数据帧保存到csv文件。你知道吗

如果我运行下面的代码,它的工作,但我的程序运行的内存不足。它想找到一种优化的方法。现在~1500个URL需要5小时才能处理。而整个csv是20k

import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from pandas import Series


urlList = pd.read_csv(r"url.csv")
urlList = urlList.url.tolist()
notfound = []
found = []
skulist =[]
skumissinglist =[]


# Function scrap, pass url, open with soup, and find class
def scrap(url):
    tag ='select'
    classused = "super-attribute-select"
    d = dict(A=np.array(found), B=np.array(skulist), C=np.array(notfound), D=np.array(skumissinglist))

    try:
        content = urllib.request.urlopen(url)
        soup = BeautifulSoup(content, features="html.parser")
        sku= soup.find("div", {"itemprop": "sku"}).string
        result = soup.find(tag, class_=classused)
        #soup returns None if can't find anything
        if result == None:
            notfound.append(url)
            skumissinglist.append(sku)
        else:
            found.append(url)
            skulist.append(sku)

    except:
        result = print("Some extraction went wrong")

    df = pd.DataFrame(dict([(k, Series(v)) for k, v in d.items()]))
    df = df.to_csv('Test.csv')

for i in urlList:
    scrap(i)

Tags: csvimporturlnpfindarraypdsoup
2条回答

如果我这样做,我会尝试一些事情:

(1)更新字典而不是附加到列表。我认为字典比单子更快,内存效率更高。你知道吗

(2)与其将每个URL结果导出为具有相同名称的CSV,不如(a)首选:等到完成后再将所有结果导出为单个CSV,或者(b)更糟:可以使用f字符串而不是覆盖将它们导出到不同的文件名测试.csv“每次。你知道吗

可以将池与gevent一起使用,也可以使用urllib3(或请求)中的内置池。然后,您可以根据池大小一次执行10或100个操作,并在池耗尽时使用异步队列来获取剩余的操作。你知道吗

from gevent import monkey, spawn, joinall
monkey.patch_all()
from gevent.pool import Pool as GeventPool
import pandas as pd
from pandas import Series
import numpy as np
import requests
from bs4 import BeautifulSoup

urlList = pd.read_csv(r"url.csv")
urlList = urlList.url.tolist()
pool = GeventPool(10)
notfound = []
found = []
skulist =[]
skumissinglist =[]
count = len(urllist)

# Function scrap, pass url, open with soup, and find class
def scrap(url):
    tag ='select'
    classused = "super-attribute-select"
    d = dict(A=np.array(found), B=np.array(skulist), C=np.array(notfound), D=np.array(skumissinglist))

    try:
        content = requests.get(url).text
        soup = BeautifulSoup(content, features="html.parser")
        sku= soup.find("div", {"itemprop": "sku"}).string
        result = soup.find(tag, class_=classused)
        #soup returns None if can't find anything
        if result == None:
            notfound.append(url)
            skumissinglist.append(sku)
        else:
            found.append(url)
            skulist.append(sku)

    except:
        print("Some extraction went wrong")

    df = pd.DataFrame(dict([(k, Series(v)) for k, v in d.items()]))
    return df.to_csv('Test.csv')

pool.map(scrap, urllist)

相关问题 更多 >

    热门问题