类变量字典未与一起保存泡菜。倾倒在Python2.7中

2024-10-01 13:37:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用pickle通过转储根来保存对象图。当我加载根时,它有所有的实例变量和连接的对象节点。不过,我将所有节点保存在dictionary类型的类变量中。类变量在保存前已满,但在我取消数据拾取后,它是空的。在

以下是我使用的课程:

class Page():

    __crawled = {}

    def __init__(self, title = '', link = '', relatedURLs = []):
        self.__title = title
        self.__link = link
        self.__relatedURLs = relatedURLs
        self.__related = [] 

    @property
    def relatedURLs(self):
        return self.__relatedURLs

    @property
    def title(self):
        return self.__title

    @property
    def related(self):
        return self.__related

    @property
    def crawled(self):
        return self.__crawled

    def crawl(self,url):
        if url not in self.__crawled:
            webpage = urlopen(url).read()
            patFinderTitle = re.compile('<title>(.*)</title>')
            patFinderLink = re.compile('<link rel="canonical" href="([^"]*)" />')
            patFinderRelated = re.compile('<li><a href="([^"]*)"')

            findPatTitle = re.findall(patFinderTitle, webpage)
            findPatLink = re.findall(patFinderLink, webpage)
            findPatRelated = re.findall(patFinderRelated, webpage)
            newPage = Page(findPatTitle,findPatLink,findPatRelated)
            self.__related.append(newPage)
            self.__crawled[url] = newPage
        else:
            self.__related.append(self.__crawled[url])

    def crawlRelated(self):
        for link in self.__relatedURLs:
            self.crawl(link)

我是这样保存的:

^{pr2}$

我是这样装的:

def loadGraph(filename): #returns root
    with open(filename,'r') as inf:
        return pickle.load(inf)

root = loadGraph('medTwiceGraph.dat')

除了类变量\uuu crawled之外的所有数据加载。在

我做错什么了?在


Tags: selfreurlreturntitledeflinkproperty
3条回答

对于任何感兴趣的人,我所做的就是制作一个包含实例变量的超类图,并将我的爬行函数移动到图中。页面现在只包含描述页面及其相关页面的属性。我pickle我的Graph实例,其中包含我的所有Page实例。这是我的密码。在

from urllib import urlopen
#from bs4 import BeautifulSoup
import re
import pickle

###################CLASS GRAPH####################
class Graph(object):
    def __init__(self,roots = [],crawled = {}):
        self.__roots = roots
        self.__crawled = crawled
    @property
    def roots(self):
        return self.__roots
    @property
    def crawled(self):
        return self.__crawled
    def crawl(self,page,url):
        if url not in self.__crawled:
            webpage = urlopen(url).read()
            patFinderTitle = re.compile('<title>(.*)</title>')
            patFinderLink = re.compile('<link rel="canonical" href="([^"]*)" />')
            patFinderRelated = re.compile('<li><a href="([^"]*)"')

            findPatTitle = re.findall(patFinderTitle, webpage)
            findPatLink = re.findall(patFinderLink, webpage)
            findPatRelated = re.findall(patFinderRelated, webpage)
            newPage = Page(findPatTitle,findPatLink,findPatRelated)
            page.related.append(newPage)
            self.__crawled[url] = newPage
        else:
            page.related.append(self.__crawled[url])

    def crawlRelated(self,page):
        for link in page.relatedURLs:
            self.crawl(page,link)
    def crawlAll(self,obj,limit = 2,i = 0):
        print 'number of crawled pages:', len(self.crawled)
        i += 1
        if i > limit:
            return
        else:
            for rel in obj.related:
                print 'crawling', rel.title
                self.crawlRelated(rel)
            for rel2 in obj.related:
                self.crawlAll(rel2,limit,i)          
    def loadGraph(self,filename):
        with open(filename,'r') as inf:
            return pickle.load(inf)
    def saveGraph(self,obj,filename):
        with open(filename,'w') as outf:
            pickle.dump(obj,outf)
###################CLASS PAGE#####################
class Page(Graph):
    def __init__(self, title = '', link = '', relatedURLs = []):
        self.__title = title
        self.__link = link
        self.__relatedURLs = relatedURLs
        self.__related = []      
    @property
    def relatedURLs(self):
        return self.__relatedURLs 
    @property
    def title(self):
        return self.__title
    @property
    def related(self):
        return self.__related
####################### MAIN ######################
def main(seed):
    print 'doing some work...'
    webpage = urlopen(seed).read()

    patFinderTitle = re.compile('<title>(.*)</title>')
    patFinderLink = re.compile('<link rel="canonical" href="([^"]*)" />')
    patFinderRelated = re.compile('<li><a href="([^"]*)"')

    findPatTitle = re.findall(patFinderTitle, webpage)
    findPatLink = re.findall(patFinderLink, webpage)
    findPatRelated = re.findall(patFinderRelated, webpage)

    print 'found the webpage', findPatTitle

    #root = Page(findPatTitle,findPatLink,findPatRelated)
    G = Graph([Page(findPatTitle,findPatLink,findPatRelated)])
    print 'crawling related...'
    G.crawlRelated(G.roots[0])
    G.crawlAll(G.roots[0])  
    print 'now saving...'
    G.saveGraph(G, 'medTwiceGraph.dat')
    print 'done'
    return G
#####################END MAIN######################

#'http://medtwice.com/am-i-pregnant/'
#'medTwiceGraph.dat'

#G = main('http://medtwice.com/menopause-overview/')
#print G.crawled


def loadGraph(filename):
    with open(filename,'r') as inf:
        return pickle.load(inf)

G = loadGraph('MedTwiceGraph.dat')
print G.roots[0].title
print G.roots[0].related
print G.crawled

for key in G.crawled:
    print G.crawled[key].title

默认情况下,pickle只使用self.__dict__的内容,而不使用您认为想要的self.__class__.__dict__。在

我说“你认为你想要什么”是因为取消一个实例不应该改变类级别的状态。在

如果你想改变这种行为,那么看看__getstate__和{}in the docs

Python并没有真正地pickle类对象。它只是保存它们的名字和在哪里找到它们。根据^{}的文档:

Similarly, classes are pickled by named reference, so the same restrictions in the unpickling environment apply. Note that none of the class’s code or data is pickled, so in the following example the class attribute attr is not restored in the unpickling environment:

class Foo:
    attr = 'a class attr'

picklestring = pickle.dumps(Foo)

These restrictions are why picklable functions and classes must be defined in the top level of a module.

Similarly, when class instances are pickled, their class’s code and data are not pickled along with them. Only the instance data are pickled. This is done on purpose, so you can fix bugs in a class or add methods to the class and still load objects that were created with an earlier version of the class. If you plan to have long-lived objects that will see many versions of a class, it may be worthwhile to put a version number in the objects so that suitable conversions can be made by the class’s __setstate__() method.

在您的示例中,您可以解决将__crawled更改为实例属性或全局变量的问题。在

相关问题 更多 >