用BS4 vs预设数据问题处理一个已删除的列表(为什么它的工作方式不一样?)

2024-09-30 02:18:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试将此刮取的数据保存到文件(pickle it),但我不明白为什么无法使用以下代码对其进行pickle:

url = "https://www.imdb.com/list/ls016522954/?ref_=nv_tvv_dvd"

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
html_soup = BeautifulSoup(webpage, 'html5lib')
dvdNames = html_soup.find_all("div", class_="lister-item-content")
for dvd in dvdNames:
    dvdArray.append(dvd.a.string)
viewtitles = input("Finished!, do you want to view the DVD titles? (Y/N): ")
if viewtitles == "y".casefold():
    num = 1
    for name in dvdArray:
        print(""+ str(num) + " - " + name)
        num += 1
elif viewtitles == "n".casefold():
    print("Not Showing TItles!")
else:
    print("that is not an option!")
saveToFile = input("Do you want to save / update the data? (Y/N): ")
if saveToFile == "y".casefold():
    with open("IMDBDVDNames.dat", "wb") as f:
        pickle.dump(dvdArray, f)
        continue
elif saveToFile == "n".casefold():
    print("Data Not Saved!")
    continue
else:
    print("That's not one of the option!")
    continue

我已尝试添加sys.setrecursionlimit(1000000),但没有任何区别(仅供参考),并且出现以下错误“在酸洗对象时超过最大递归深度”,但当我运行以下代码时:

import pickle

testarray = []

if input("1 or 2?: ") == "1":
    testarray = ['1917', 'Onward', 'The Hunt', 'The Invisible Man', 'Human Capital', 'Dolittle', 'Birds of Prey: And the Fantabulous Emancipation of One Harley Quinn', 'The Gentlemen', 'Bloodshot', 'The Way Back', 'Clemency', 'The Grudge', 'I Still Believe', 'The Song of Names', 'Treadstone', 'Vivarium', 'Star Wars: Episode IX - The Rise of Skywalker', 'The Current War', 'Downhill', 'The Call of the Wild', 'Resistance', 'Banana Split', 'Bad Boys for Life', 'Sonic the Hedgehog', 'Mr. Robot', 'The Purge', 'VFW', 'The Other Lamb', 'Slay the Dragon', 'Clover', 'Lazy Susan', 'Rogue Warfare: The Hunt', 'Like a Boss', 'Little Women', 'Cats', 'Madam Secretary', 'Escape from Pretoria', 'The Cold Blue', 'The Night Clerk', 'Same Boat', 'The 420 Movie: Mary & Jane', 'Manou the Swift', 'Gold Dust', 'Sea Fever', 'Miles Davis: Birth of the Cool', 'The Lost Husband', 'Stray Dolls', 'Mortal Kombat Legends: Scorpions Revenge', 'Just Mercy', 'The Righteous Gemstones', 'Criminal Minds', 'Underwater', 'Final Kill', 'Green Rush', 'Butt Boy', 'The Quarry', 'Abe', 'Bad Therapy', 'Yip Man 4', 'The Last Full Measure', 'Looking for Alaska', 'The Turning', 'True History of the Kelly Gang', 'To the Stars', 'Robert the Bruce', 'Papa, sdokhni', 'The Rhythm Section', 'Arrow', 'The Assistant', 'Guns Akimbo', 'The Dark Red', 'Dreamkatcher', 'Fantasy Island', 'The Etruscan Smile', "A Nun's Curse", 'Allagash']
    with open("test.dat", "wb") as f:
        pickle.dump(testarray, f)
else:
    with open("test.dat", "rb") as f:
        testarray = pickle.load(f)

print(testarray)

使用完全相同的信息(至少我希望是相同的,我打印了(DVD阵列)并以这种方式获得了列表供参考),但这样做时,它将允许我对其进行pickle

有人能告诉我为什么以及如何修复它吗

我知道我正在从一个网站上抓取数据并将其转换为一个列表,但无法找出是什么原因导致了示例1和示例2中的错误

任何帮助都将不胜感激

谢谢

利特尔基弗


Tags: oftheforinputifelsenumpickle
2条回答

BeautifulSoup对象是高度递归的,因此很难pickle。当您执行dvdArray.append(dvd.a.string)操作时,dvd.a.string不是python字符串,而是bs4.element.NavigableString-这些复杂对象之一。通过使用strip(),实际上您正在将bs4.element.NavigableString转换为python字符串,该字符串很容易被pickle。如果使用dvd.a.getText(),情况也会如此

为便于将来参考,在酸洗时,请始终记住(如果可能)将BeautifulSoup对象转换为更简单的python对象

如果有人好奇,我在添加DVD阵列时添加了“strip()”,它成功了

dvdArray.append(dvd.a.string.strip())

相关问题 更多 >

    热门问题