Python-lucene函数向文档添加字段内容不起作用

2024-09-30 14:30:08 发布

您现在位置：Python中文网/ 问答频道 /正文

9733

网友

男 | 程序猿一只，喜欢编程写python代码。

我用pythonlucene索引url页面。在

我在尝试向文档添加字段时出错。我不知道为什么。错误说明：

Java错误：，>； Java堆栈跟踪： java.lang.IllegalArgumentException：既没有索引也没有存储的字段没有意义在org.apache.lucene.文档.字段.(Field.java:249)在

按照我的说法：文件添加（字段（“内容”，文本，t2））

我使用的python代码是：

def IndexerForUrl(start, number, domain):

lucene.initVM()
# join base dir and index dir
path = os.path.abspath("paths")
directory = SimpleFSDirectory(Paths.get(path)) # the index

analyzer = StandardAnalyzer()

writerConfig = IndexWriterConfig(analyzer)

writerConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE)

writer = IndexWriter(directory, writerConfig)

print "reading lines from sys.std..."

# hashtable dictionary
D = {}

D[start] = [start]



numVisited = 0
wordBool = False

n = start

queue = [start]
visited = set()

t1 = FieldType()
t1.setStored(True)
t1.setTokenized(False)

t2 = FieldType()
t2.setStored(False)
t2.setTokenized(True)



while numVisited < number and queue and not wordBool:
    pg = queue.pop(0)

    if pg not in visited:

        visited.add(pg)

        htmlwebpg = urllib2.urlopen(pg).read()
            # robot exclusion standard
        rp = robotparser.RobotFileParser()
        rp.set_url(pg)
        rp.read() # read robots.txt url and feeds to parser


        soup = BeautifulSoup(htmlwebpg, 'html.parser')

        for script in soup(["script","style"]):
            script.extract()
        text = soup.get_text()



        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = '\n'.join(chunk for chunk in chunks if chunk)

        print text




        doc = Document()

        doc.add(Field("urlpath", pg, t2))
        if len(text)> 0:
            doc.add(Field("contents", text, t2))
        else:
            print "warning: no content in %s " % pgv

        writer.addDocument(doc)


        numVisited = numVisited+1

        linkset = set()

            # add to list
        for link in soup.findAll('a', attrs={'href':re.compile("^http://")}):
                #links.append(link.get('href'))
            if rp.can_fetch(link.get('href')):
                linkset.add(link.get('href'))

            D[pg] = linkset

            queue.extend(D[pg] - visited)

writer.commit()
writer.close()
directory.close() #close the index 
return writer

Tags： and text in add for get if queue

1条回答

网友

1楼 · 发布于 2024-09-30 14:30:08

如果一个字段既没有被索引也没有被存储，它将不会以任何方式在索引中表示，因此它在那里没有意义。我猜你想索引字段类型t2。为此，您需要set the IndexOptions，类似于：

t2.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)

Python-lucene函数向文档添加字段内容不起作用

相关问题更多 >

编程相关推荐

热门问题

热门文章

Python-lucene函数向文档添加字段内容不起作用

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >