短距离负距离？什么？

2024-09-28 23:03:55 发布

男 | 程序猿一只，喜欢编程写python代码。

我有一个输入文件，其中包含小数点后4位的浮点数：

i.e. 13359    0.0000    0.0000    0.0001    0.0001    0.0002`    0.0003    0.0007    ...

（第一个是id）。我的类使用loadVectorsFromFile方法将其乘以10000，然后int()这些数字。除此之外，我还循环遍历每个向量，以确保其中没有负值。然而，当我执行_hclustering时，我不断地看到错误，"LinkageZcontains negative values"。

我真的认为这是个错误，因为：

我检查了我的价值观
这些值的大小不足以接近浮点数和
我用来导出文件中值的公式使用绝对值（我的输入绝对正确）。

有人能告诉我为什么我会看到这个奇怪的错误吗？是什么导致了这个负距离误差？

===

def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True):
    """Inflate to prevent "negative" distance, we use 4 decimal points, so *10000
    """
    vectors = {}
    self.winfo("Each vector is set to have %d limit in length" % limit)
    with open( loc ) as inf:
        for line in filter(None, inf.read().split('\n')):
            l = line.split('\t')
            if limit:
                scores = map(float, l[1:limit+1])
            else:
                scores = map(float, l[1:])

            if inflate:        
                vectors[ l[0]] = map( lambda x: int(x*10000), scores)     #int might save space
            else:
                vectors[ l[0]] = scores                           

    if assertAllPositive:
        #Assert that it has no negative value
        for dirID, l in vectors.iteritems():
            if reduce(operator.or_, map( lambda x: x < 0, l)):
                self.werror( "Vector %s has negative values!" % dirID)
    return vectors

def main( self, inputDir, outputDir, limit=0,
        inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'):
    """
    Loads vector from a file and start clustering
    INPUT
        vectors is { featureID: tfidfVector (list), }
    """
    IDFeatureDic = loadIdFeatureGroupDicFromIntermediate( pjoin(self.configDir, mappingFname))
    if not os.path.exists(outputDir):
        os.makedirs(outputDir)

    vectors = self.loadVectorsFromFile( limit, pjoin( inputDir, inFname))
    for threshold in map( lambda x:float(x)/30, range(20,30)):
        clusters = self._hclustering(threshold, vectors)
        if clusters:
            outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold))
            with open(outputLoc, 'w') as outf:
                for clusterNo, cluster in clusters.iteritems():
                    outf.write('%s\n' % str(clusterNo))
                    for featureID in cluster:
                        feature, group = IDFeatureDic[featureID]
                        outline = "%s\t%s\n" % (feature, group)
                        outf.write(outline.encode('utf-8'))
                    outf.write("\n")
        else:
            continue

def _hclustering(self, threshold, vectors):
    """function which you should call to vary the threshold
    vectors:    { featureID:    [ tfidf scores, tfidf score, .. ]
    """
    clusters = defaultdict(list)
    if len(vectors) > 1:
        try:
            results = hierarchy.fclusterdata( vectors.values(), threshold, metric='cosine')
        except ValueError, e:
            self.werror("_hclustering: %s" % str(e))
            return False

        for i, featureID in enumerate( vectors.keys()):

Tags： in self map for threshold if outputdir clusters

0条回答

目前没有回答

短距离负距离？什么？

相关问题更多 >

编程相关推荐

热门问题

热门文章

短距离负距离？什么？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >