PDFQuery：获取元素所在的页码

import pdfquery pdf = pdfquery.PDFQuery("tests/samples/priceList.pdf") pdf.load() code = "92005G" def exactText(): element = str(vars(this)) text = str("u'" + code + "\\n'") if text in element: return True return False #This should work if i could select the page where the element is located #page = pdf.pq('LTPage:contains("'+code+'")') #pageNum = page.attr('pageid') #Here I would replace the "8" with the page number i get, or remove the LTPage #selector all together if i need to find the element first and then the page label = pdf.pq('LTPage[page_index="8"] LTTextLineHorizontal:contains("'+code+'")').filter(exactText) #Since we could use "JQuery selectors" i tried using ".closest", but it returns nothing #page = label.closest('LTPage') #pageNum = page.attr('pageid') left_corner = float(label.attr('x0')) bottom_corner = float(label.attr('y0')) #Here I would replace the "8" with the page number i get price = pdf.pq('LTPage[page_index="8"] LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % (left_corner+110, bottom_corner, left_corner+140, bottom_corner+20)).text() print price

1条回答

网友

1楼 · 发布于 2024-09-28 13:19:19

也许还有一种更优雅的方法，但是我用来查找元素所在的页面是.interncestors（'LTPage'）。下面的示例代码将找到“My Text”的所有实例，并告诉您它位于哪个页面：

for pq in pdf.pq('LTTextLineHorizontal:contains("My Text")'):
    page_pq = pq.iterancestors('LTPage').next()   # Use just the first ancestor
    print 'Found the text "%s" on page %s' % ( pq.layout.get_text(), page_pq.layout.pageid)

我希望这有帮助！：）

相关问题更多 >

编程相关推荐

热门问题

热门文章