用Django、Haystack和Whoosh为产品列表编制索引太长了

2024-10-03 06:24:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我在用Haystack和Whoosh为产品列表(约280k)编制索引时遇到问题。运行索引更新似乎需要28个多小时。我认为那根本不是一个合理的时间

我有一个模型:

class SupplierSkus(models.Model):
        sku = models.CharField(max_length=20)
        link = models.CharField(max_length=4096)
        price = models.FloatField()
        last_updated = models.DateTimeField("Date Updated", null=True, auto_now=True)
        status = models.ForeignKey(Status, on_delete=models.PROTECT, default=1)
        category = models.CharField(max_length=1024)
        family = models.CharField(max_length=20)
        family_desc = models.TextField(null=True)
        family_name = models.CharField(max_length=250)
        product_name = models.CharField(max_length=250)
        was_price = models.FloatField(null=True)
        vat_rate = models.FloatField(null=True)
        lead_from = models.IntegerField(null=True)
        lead_to = models.IntegerField(null=True)
        deliv_cost = models.FloatField(null=True)
        prod_desc = models.TextField(null=True)
        attributes = models.TextField(null=True)
        brand = models.TextField(null=True)
        mpn = models.CharField(max_length=50, null=True)
        ean = models.CharField(max_length=15, null=True)
        supplier = models.ForeignKey(Suppliers, on_delete=models.PROTECT)

我得到了一个search_index.py:

    from haystack import indexes
    from products.models import SupplierSkus

    class ProductIndex(indexes.SearchIndex, indexes.Indexable):
        text = indexes.CharField(document=True, use_template=True)
        sku = indexes.CharField(model_attr='sku')
        category = indexes.CharField(model_attr='category')
        product_name = indexes.CharField(model_attr='product_name')
        family_name = indexes.CharField(model_attr='family_name')
        prod_desc = indexes.CharField(model_attr='prod_desc')
        family_desc = indexes.CharField(model_attr='family_desc')
        brand = indexes.CharField(model_attr='brand')
        mpn = indexes.CharField(model_attr='mpn')
        ean = indexes.CharField(model_attr='ean')
        attributes = indexes.CharField(model_attr='attributes')


    def get_model(self):
        return SupplierSkus

    def index_queryset(self, using=None):
        return SupplierSkus.objects.filter(status_id=1)

我注意到2之后的Django版本在迭代大型查询集时性能大幅下降。我不确定这是为什么,但我现在通常必须在处理大型数据集时使用.iterator()函数。或分页。或者直接使用SQL—这似乎是处理大型数据集的最快方法

但我不能把一个list传递给Haystack:

class must return a 'QuerySet' in the 'index_queryset' method.

鉴于我需要发送QuerySet,我如何在合理的时间内完成这项工作


Tags: nametruemodelmodelsfamilynulllengthdesc