是否可以在pdfquery中使用正则表达式？

pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf") pdf.load() label = pdf.pq('LTTextLineHorizontal:contains("Cash")') left_corner = float(label.attr('x0')) bottom_corner = float(label.attr('y0')) cash = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % \ (left_corner, bottom_corner-30, \ left_corner+150, bottom_corner)).text() print cash '179,000.00'

1条回答

网友

1楼 · 发布于 2024-09-30 16:19:36

这并不完全是对正则表达式的查找，但它可以格式化/过滤可能的提取：

def regex_function(pattern, match):
    re_obj = re.search(pattern, match)
    if re_obj != None and len(re_obj.groups()) > 0:
        return re_obj.group(1)
    return None

pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")

pattern = ''
pdf.extract( [
('with_parent','LTPage[pageid=1]'),
('with_formatter', 'text'),
('year', 'LTTextLineHorizontal:contains("Form 1040A (")', 
        lambda match: regex_function(SOME_PATTERN_HERE, match)))
 ])

我没有测试下一个，但它也可能有用：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章

是否可以在pdfquery中使用正则表达式？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >