有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java使用pdfbox从pdf中删除不可见文本

Link to pdf

当我试图从上面的pdf中提取文本时,我得到了一个混合文本,它在Evence viewer中是不可见的,同时也是可见的。此外,一些需要的文本缺少查看器中没有缺少的字符,例如,“猎鹰”中的“S”和许多缺少的“½”字符。我认为这是由于不可见文本的干扰,因为在查看器中突出显示pdf时,可以看到不可见文本与可见文本重叠

有没有办法删除不可见的文本?还是有其他解决办法

代码:

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;


public class App {

    public static String getPdfText(String pdfPath) throws IOException {
        File file = new File(pdfPath);
        PDDocument document = null;
        PDFTextStripper textStripper = null;
        String text = null;

        try {
            document = PDDocument.load(file);
            textStripper = new PDFTextStripper();
            textStripper.setEndPage(1);
            text =  textStripper.getText(document);
        } catch (IOException e) {
            throw new IOException("Could not load file and strip text.", e);
        } finally {
            try {
                if (document != null)
                    document.close();
            } catch (IOException e) {
                System.out.println("Could not close document");
            }
        }

        return text;
    }

    public static void main(String[] args) {
        String filename = "RevTeaser09072016.pdf";
        String text = null;

        try {
            text = getPdfText(filename);
        } catch (IOException e) {
            e.printStackTrace();
            System.exit(1);
        }

        System.out.println(text);
    }
}

输出(粗体文本为所需文本):

145
143
159
144
160
141
157155 156154150 153149 152148 151147
142
158
500
146
Selections
Number of Teams
Amount Bet
REVERSE tEaSER caRd
mark box as shown 
 denotes home team
PRO FOOTBALL - THURSDAY,  NOVEMBER 15, 2012
1 BILLS ★ NFL  PM8:25 2 DOLPHINS7– ½ 6– ½
PRO FOOTBALL - SUNDAY, NOVEMBER 18, 2012
3 REDSKINS ★  PM1:00 4 EAGLES10– ½ 3– ½
5 PACKERS  PM1:00 6 LIONS ★10– ½ 3– ½
7 FALCONS ★  PM1:00 8 CARDINALS17– ½ 3+ ½
9 BUCCANEERS  PM1:00 10 PANTHERS ★7– ½ 6– ½
11 COWBOYS ★  PM1:00 12 BROWNS14– ½ + ½
13 RAMS ★  PM1:00 14 JETS10– ½ 3– ½
15 PATRIOTS ★  PM4:25 16 COLTS17– ½ 3+ ½
17 TEXANS ★  PM1:00 18 JAGUARS23– ½ 9+ ½
19 BENGALS  PM1:00 20 CHIEFS ★10– ½ 3– ½
21 SAINTS  PM4:05 22 RAIDERS ★12– ½ 1– ½
23 BRONCOS ★  PM4:25 24 CHARGERS14– ½ + ½
25 RAVENS NBC  PM8:30 26 STEELERS ★7– ½ 6– ½
PRO FOOTBALL - MONDAY, NOVEMBER 19, 2012
27 49ERS ★ ESPN  PM8:40 28 BEARS10– ½ 3– ½
1,000
145
143
159
144
160
141
157155 156154150 153149 152148 151147
142
158
500
146
Selections
Number of Teams
Amount Bet
REVERSE tEaSER caRd
mark box as hown 
 denotes home team
PRO FOOTBALL - THURSDAY,  NOVEMBER 15, 2012
1 BILLS ★ NFL  PM8:25 2 DOLPHINS7– ½ 6– ½
PRO FOOTBALL - SUNDAY, NOVEMBER 18, 2012
3 REDSKINS ★  PM1:00 4 EAGLES10– ½ 3– ½
5 PACKERS  PM1:00 6 LIONS ★10– ½ 3– ½
7 FALCONS ★  PM1:00 8 CARDINALS17– ½ 3+ ½
9 BUCCANEERS  PM1:00 10 PANTHERS ★7– ½ 6– ½
11 COWBOYS ★  PM1:00 12 BROWNS14– ½ + ½
13 RAMS ★  PM1:00 14 JETS10– ½ 3– ½
15 PATRIOTS ★  PM4:25 16 COLTS17– ½ 3+ ½
17 TEXANS ★  PM1:00 18 JAGUARS23– ½ 9+ ½
19 BENGALS  PM1:00 20 CHIEFS ★10– ½ 3– ½
21 SAINTS  PM4:05 22 RAIDERS ★12– ½ 1– ½
23 BRONCOS ★  PM4:25 24 CHARGERS14– ½ + ½
25 RAVENS NBC  PM8:30 26 STEEL RS ★7– ½ 6– ½
PRO FOOTBALL - MONDAY, NOVEMBER 19, 2012
27 49ERS ★ ESPN  PM8:40 28 BEARS10– ½ 3– ½
1,000
145
143
159
14
160
41
15715 156154150 153149 152148 51147
142
158
50
146
S lections
Number of Teams
Amount Bet

ark box as sho n 
 denotes home team
PRO F OTBALL - THURSDAY, NOVEMBER 15, 2012
1 BILLS ★ NFL  PM8:25 2 DOLPHINS7– ½ 6– ½
PRO F OTBALL - SUNDAY, NOVEMBER 18, 2012
3 REDSKINS ★  PM1:0 4 EAGLES10– ½ 3– ½
5 PACKERS  PM1:0 6 LIONS ★10– ½ 3– ½
7 FALCONS ★  PM1:0 8 CARDINALS17– ½ 3+ ½
9 BU CANEERS  PM1:0 10 PANTHERS ★7– ½ 6– ½
11 COWBOYS ★  PM1:0 12 BROWNS14– ½ + ½
13 RAMS ★  PM1:0 14 JETS10– ½ 3– ½
15 PATRIOTS ★  PM4:25 16 COLTS17– ½ 3+ ½
17 TEXANS ★  PM1:0 18 JAGUARS23– ½ 9+ ½
19 BENGALS  PM1:0 20 CHIEFS ★10– ½ 3– ½
21 SAINTS  PM4:05 22 RAIDERS ★12– ½ 1– ½
23 BRONCOS ★  PM4:25 24 CHARGERS14– ½ + ½
25 RAVENS NBC  PM8:30 26 STEELERS ★7– ½ 6– ½
PRO F OTBALL - MONDAY, NOVEMBER 19, 2012
27 49ERS ★ ESPN  PM8:40 28 BEARS10– ½ 3– ½
1,0
MARK BOX AS SHOWN 
DENOTES HOME TEAM
PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016
 1 PANTHERS    nbc  - 10½ 8:30p 2 BRONCOS   - 3½
 PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016
  FALCON      - 9  1:00p 4 BUCCANEERS  - 4½
 5 VIKINGS   - 9½ 1:00p 6 TITANS  - 4½
 7 EAGLES  - 10½ 1:00p 8 BROWNS  - 3½
 9 BENGALS - 9½ 1:00p 10 JETS  - 4½
 11 SAINTS    - 7½ 1:00p 12 RAIDERS   - 6½
 13 CHIEFS  - 14½ 1:00p 14 CHARGERS  + ½
 15 RAVENS  - 10½ 1:00p 16 BILLS - 3½
 17 TEXANS  - 14  1:00p 18 BEARS + ½
 19 PACKERS - 12  1:00p 20 JAGUARS  - 1½
 21 SEAHAWKS    - 17½ 4:05p 22 DOLPHINS + 3½
 23 COWBOYS    - 7½ 4:25p 24 GIANTS - 6½
 25 COLTS     - 10½ 4:25p 26 LIONS - 3½
 27 CARDINALS   nbc  - 14½ 8:30p 28 PATRIOTS + ½
 PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016
 29 STEELERS  espn  - 10½ 7:10p 30 REDSKINS  - 3½
 31 RAMS  espn  - 9  10:20p 32 49ERS  - 4½

共 (1) 个答案

  1. # 1 楼答案

    OP的示例PDF中的不可见文本主要是通过定义剪辑路径(在文本所在的边界之外)和填充路径(将文本隐藏在下面)来实现不可见的。因此,在文本提取过程中,我们必须考虑路径相关指令,而忽略了不可见文本。p>

    不幸的是,为这些指令设计的回调没有在PDFTextStripper或其父类LegacyPDFStreamEnginePDFStreamEngine中声明

    但是它们在另一个主要的PDFStreamEngine子类PDFGraphicsStreamEngine中声明,并且在PageDrawer中合理地实现

    因此,为了利用这一点,我们可以复制&;粘贴&;将PageDrawer实现改编成PDFTextStripper的子类,例如:

    public class PDFVisibleTextStripper extends PDFTextStripper {
        public PDFVisibleTextStripper() throws IOException {
            addOperator(new AppendRectangleToPath());
            addOperator(new ClipEvenOddRule());
            addOperator(new ClipNonZeroRule());
            addOperator(new ClosePath());
            addOperator(new CurveTo());
            addOperator(new CurveToReplicateFinalPoint());
            addOperator(new CurveToReplicateInitialPoint());
            addOperator(new EndPath());
            addOperator(new FillEvenOddAndStrokePath());
            addOperator(new FillEvenOddRule());
            addOperator(new FillNonZeroAndStrokePath());
            addOperator(new FillNonZeroRule());
            addOperator(new LineTo());
            addOperator(new MoveTo());
            addOperator(new StrokePath());
        }
    
        @Override
        protected void processTextPosition(TextPosition text) {
            Matrix textMatrix = text.getTextMatrix();
            Vector start = textMatrix.transform(new Vector(0, 0));
            Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
    
            PDGraphicsState gs = getGraphicsState();
            Area area = gs.getCurrentClippingPath();
            if (area == null || (area.contains(start.getX(), start.getY()) && area.contains(end.getX(), end.getY())))
                super.processTextPosition(text);
        }
    
        private GeneralPath linePath = new GeneralPath();
    
        void deleteCharsInPath() {
            for (List<TextPosition> list : charactersByArticle) {
                List<TextPosition> toRemove = new ArrayList<>();
                for (TextPosition text : list) {
                    Matrix textMatrix = text.getTextMatrix();
                    Vector start = textMatrix.transform(new Vector(0, 0));
                    Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
                    if (linePath.contains(start.getX(), start.getY()) || linePath.contains(end.getX(), end.getY())) {
                        toRemove.add(text);
                    }
                }
                if (toRemove.size() != 0) {
                    System.out.println(toRemove.size());
                    list.removeAll(toRemove);
                }
            }
        }
    
        public final class AppendRectangleToPath extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                if (operands.size() < 4) {
                    throw new MissingOperandException(operator, operands);
                }
                if (!checkArrayTypesClass(operands, COSNumber.class)) {
                    return;
                }
                COSNumber x = (COSNumber) operands.get(0);
                COSNumber y = (COSNumber) operands.get(1);
                COSNumber w = (COSNumber) operands.get(2);
                COSNumber h = (COSNumber) operands.get(3);
    
                float x1 = x.floatValue();
                float y1 = y.floatValue();
    
                // create a pair of coordinates for the transformation
                float x2 = w.floatValue() + x1;
                float y2 = h.floatValue() + y1;
    
                Point2D p0 = context.transformedPoint(x1, y1);
                Point2D p1 = context.transformedPoint(x2, y1);
                Point2D p2 = context.transformedPoint(x2, y2);
                Point2D p3 = context.transformedPoint(x1, y2);
    
                // to ensure that the path is created in the right direction, we have to create
                // it by combining single lines instead of creating a simple rectangle
                linePath.moveTo((float) p0.getX(), (float) p0.getY());
                linePath.lineTo((float) p1.getX(), (float) p1.getY());
                linePath.lineTo((float) p2.getX(), (float) p2.getY());
                linePath.lineTo((float) p3.getX(), (float) p3.getY());
    
                // close the subpath instead of adding the last line so that a possible set line
                // cap style isn't taken into account at the "beginning" of the rectangle
                linePath.closePath();
            }
    
            @Override
            public String getName() {
                return "re";
            }
        }
    
        public final class StrokePath extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.reset();
            }
    
            @Override
            public String getName() {
                return "S";
            }
        }
    
        public final class FillEvenOddRule extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
                deleteCharsInPath();
                linePath.reset();
            }
    
            @Override
            public String getName() {
                return "f*";
            }
        }
    
        public class FillNonZeroRule extends OperatorProcessor {
            @Override
            public final void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
                deleteCharsInPath();
                linePath.reset();
            }
    
            @Override
            public String getName() {
                return "f";
            }
        }
    
        public final class FillEvenOddAndStrokePath extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
                deleteCharsInPath();
                linePath.reset();
            }
    
            @Override
            public String getName() {
                return "B*";
            }
        }
    
        public class FillNonZeroAndStrokePath extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
                deleteCharsInPath();
                linePath.reset();
            }
    
            @Override
            public String getName() {
                return "B";
            }
        }
    
        public final class ClipEvenOddRule extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.setWindingRule(GeneralPath.WIND_EVEN_ODD);
                getGraphicsState().intersectClippingPath(linePath);
            }
    
            @Override
            public String getName() {
                return "W*";
            }
        }
    
        public class ClipNonZeroRule extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.setWindingRule(GeneralPath.WIND_NON_ZERO);
                getGraphicsState().intersectClippingPath(linePath);
            }
    
            @Override
            public String getName() {
                return "W";
            }
        }
    
        public final class MoveTo extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                if (operands.size() < 2) {
                    throw new MissingOperandException(operator, operands);
                }
                COSBase base0 = operands.get(0);
                if (!(base0 instanceof COSNumber)) {
                    return;
                }
                COSBase base1 = operands.get(1);
                if (!(base1 instanceof COSNumber)) {
                    return;
                }
                COSNumber x = (COSNumber) base0;
                COSNumber y = (COSNumber) base1;
                Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
                linePath.moveTo(pos.x, pos.y);
            }
    
            @Override
            public String getName() {
                return "m";
            }
        }
    
        public class LineTo extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                if (operands.size() < 2) {
                    throw new MissingOperandException(operator, operands);
                }
                COSBase base0 = operands.get(0);
                if (!(base0 instanceof COSNumber)) {
                    return;
                }
                COSBase base1 = operands.get(1);
                if (!(base1 instanceof COSNumber)) {
                    return;
                }
                // append straight line segment from the current point to the point
                COSNumber x = (COSNumber) base0;
                COSNumber y = (COSNumber) base1;
    
                Point2D.Float pos = context.transformedPoint(x.floatValue(), y.floatValue());
    
                linePath.lineTo(pos.x, pos.y);
            }
    
            @Override
            public String getName() {
                return "l";
            }
        }
    
        public class CurveTo extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                if (operands.size() < 6) {
                    throw new MissingOperandException(operator, operands);
                }
                if (!checkArrayTypesClass(operands, COSNumber.class)) {
                    return;
                }
                COSNumber x1 = (COSNumber) operands.get(0);
                COSNumber y1 = (COSNumber) operands.get(1);
                COSNumber x2 = (COSNumber) operands.get(2);
                COSNumber y2 = (COSNumber) operands.get(3);
                COSNumber x3 = (COSNumber) operands.get(4);
                COSNumber y3 = (COSNumber) operands.get(5);
    
                Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
                Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
                Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
    
                linePath.curveTo(point1.x, point1.y, point2.x, point2.y, point3.x, point3.y);
            }
    
            @Override
            public String getName() {
                return "c";
            }
        }
    
        public final class CurveToReplicateFinalPoint extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                if (operands.size() < 4) {
                    throw new MissingOperandException(operator, operands);
                }
                if (!checkArrayTypesClass(operands, COSNumber.class)) {
                    return;
                }
                COSNumber x1 = (COSNumber) operands.get(0);
                COSNumber y1 = (COSNumber) operands.get(1);
                COSNumber x3 = (COSNumber) operands.get(2);
                COSNumber y3 = (COSNumber) operands.get(3);
    
                Point2D.Float point1 = context.transformedPoint(x1.floatValue(), y1.floatValue());
                Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
    
                linePath.curveTo(point1.x, point1.y, point3.x, point3.y, point3.x, point3.y);
            }
    
            @Override
            public String getName() {
                return "y";
            }
        }
    
        public class CurveToReplicateInitialPoint extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                if (operands.size() < 4) {
                    throw new MissingOperandException(operator, operands);
                }
                if (!checkArrayTypesClass(operands, COSNumber.class)) {
                    return;
                }
                COSNumber x2 = (COSNumber) operands.get(0);
                COSNumber y2 = (COSNumber) operands.get(1);
                COSNumber x3 = (COSNumber) operands.get(2);
                COSNumber y3 = (COSNumber) operands.get(3);
    
                Point2D currentPoint = linePath.getCurrentPoint();
    
                Point2D.Float point2 = context.transformedPoint(x2.floatValue(), y2.floatValue());
                Point2D.Float point3 = context.transformedPoint(x3.floatValue(), y3.floatValue());
    
                linePath.curveTo((float) currentPoint.getX(), (float) currentPoint.getY(), point2.x, point2.y, point3.x, point3.y);
            }
    
            @Override
            public String getName() {
                return "v";
            }
        }
    
        public final class ClosePath extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.closePath();
            }
    
            @Override
            public String getName() {
                return "h";
            }
        }
    
        public final class EndPath extends OperatorProcessor {
            @Override
            public void process(Operator operator, List<COSBase> operands) throws IOException {
                linePath.reset();
            }
    
            @Override
            public String getName() {
                return "n";
            }
        }
    }
    

    PDFVisibleTextStripper

    请确保使用PDFVisibleTextStripper构造函数中的内部运算符类,而不是PageDrawer使用的同名类。确保只需遵循代码下的链接即可

    这会将输出减少到

    REVERSE tEaSER caRd
    500
    elections
    er of Teams
    t Bet
    1,000
    MARK BOX AS SHOWN 
    DENOTES HOME TEAM
    PRO FOOTBALL - THURSDAY, SEPTEMBER 8, 2016
     1 PANTHERS    nbc  - 10½ 8:30p 2 BRONCOS   - 3½
     PRO FOOTBALL - SUNDAY, SEPTEMBER 11, 2016
     3 FALCONS     - 9½ 1:00p 4 BUCCANEERS  - 4½
     5 VIKINGS   - 9½ 1:00p 6 TITANS  - 4½
     7 EAGLES  - 10½ 1:00p 8 BROWNS  - 3½
     9 BENGALS - 9½ 1:00p 10 JETS  - 4½
     11 SAINTS    - 7½ 1:00p 12 RAIDERS   - 6½
     13 CHIEFS  - 14½ 1:00p 14 CHARGERS  + ½
     15 RAVENS  - 10½ 1:00p 16 BILLS - 3½
     17 TEXANS  - 14½ 1:00p 18 BEARS + ½
     19 PACKERS - 12½ 1:00p 20 JAGUARS  - 1½
     21 SEAHAWKS    - 17½ 4:05p 22 DOLPHINS + 3½
     23 COWBOYS    - 7½ 4:25p 24 GIANTS - 6½
     25 COLTS     - 10½ 4:25p 26 LIONS - 3½
     27 CARDINALS   nbc  - 14½ 8:30p 28 PATRIOTS + ½
     PRO FOOTBALL - MONDAY, SEPTEMBER 12, 2016
     29 STEELERS  espn  - 10½ 7:10p 30 REDSKINS  - 3½
     31 RAMS  espn  - 9½ 10:20p 32 49ERS  - 4½
    

    这会删除大部分不需要的数据


    this question的上下文中,显然processTextPositiondeleteCharsInPath计算字符基线结尾的方式隐式假定为水平文本,而不进行页面旋转。但是,如果一个人放松了“可见性”的标准,那么当一个角色的基线开始可见时,他可以假设该角色是可见的。在这种情况下,人们不再需要计算出的Vector end,代码也可以用于旋转页面


    this question的上下文中,很明显,由于浮点计算错误,恰好位于剪辑路径边界上的glyph原点坐标可能会漂移到剪辑路径之外。切换到“胖点坐标检查”被证明是一个可以接受的解决办法