<p>有一个20k+网址的csv我想刮,找到html元素“超级属性选择”。如果找到,将url写入A列,同时将产品编号(sku)写入B列。如果找不到,将url写入C列,将sku写入D列。最后,将数据帧保存到csv文件。你知道吗</p>
<p>如果我运行下面的代码,它的工作,但我的程序运行的内存不足。它想找到一种优化的方法。现在~1500个URL需要5小时才能处理。而整个csv是20k</p>
<pre><code>import urllib.request
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from pandas import Series
urlList = pd.read_csv(r"url.csv")
urlList = urlList.url.tolist()
notfound = []
found = []
skulist =[]
skumissinglist =[]
# Function scrap, pass url, open with soup, and find class
def scrap(url):
tag ='select'
classused = "super-attribute-select"
d = dict(A=np.array(found), B=np.array(skulist), C=np.array(notfound), D=np.array(skumissinglist))
try:
content = urllib.request.urlopen(url)
soup = BeautifulSoup(content, features="html.parser")
sku= soup.find("div", {"itemprop": "sku"}).string
result = soup.find(tag, class_=classused)
#soup returns None if can't find anything
if result == None:
notfound.append(url)
skumissinglist.append(sku)
else:
found.append(url)
skulist.append(sku)
except:
result = print("Some extraction went wrong")
df = pd.DataFrame(dict([(k, Series(v)) for k, v in d.items()]))
df = df.to_csv('Test.csv')
for i in urlList:
scrap(i)
</code></pre>