java如何将长字符串转换为数组或数据库字段？

1 周，2 日 Questions & Answers 175

我正忙着在OpenBravoPOS上做一些扩展，阅读我们订购产品的公司的发票

此发票以PDF格式创建。我使用Itext库来读取特定的订单行。问题是我能够在一个大字符串中阅读我需要的页面。这个字符串看起来像

LEVERINGSBON 30/06/2012 27828/2012/NL/WebShop   Distributeur ID nummer: 15099191 Uw distributeur: Klant Naam: FM Point Marcel Snoeck Adres: Zonnedauw 17 5953MS Reuver Telefoon: +31654317017 E-MAIL: yvonneenmarcel@home.nl Opmerking: -  Lp. Rekening Totaal FV/39525/2012/NL     vd Wal Sandra 72.00 1 3 x 354 - Luxury Collection 50ml NEW! 72.00 FV/39526/2012/NL     Slaats Tim 6.00 2 1 x KR01 - Eye Pencil DECADENCE BLACK 6.00 FV/39527/2012/NL     Nabben Britt 44.95 3 3 x E013 - Krachtreiniger 1000ml 24.75 4 2 x E016 -Tapijtreiniger 1000ml 9.20 5 1 x 3 Step Mascara PERFECT BLACK 11.00 FV/39528/2012/NL     Nabben Lieke 32.00 6 1 x 192 - Luxury Collection 50ml 21.00 7 1 x 3 Step Mascara PERFECT BLACK 11.00 FV/39529/2012/NL     Claessens Patrick 12.40 8 1 x P101 - Peeling VERBENA 12.40 FV/39530/2012/NL     Smits Yolanda 56.00 9 1 x E006 - Wasmiddel VIVID COLOURS 1000ml 7.00 10 2 x B023 - Body Lotion 200ml NEW 18.40 11 2 x 023 - Classic Collection 30ml 30.60 FV/39531/2012/NL     van Pol-Thijssen Silvia 34.70 12 1 x 110 - Classic Collection 50ml 15.30 13 1 x N003 - Nagellak HOT RED 7.00 14 1 x P103 - Peeling CHERRY BLOSSOM 12.40 Aantal: 21 Totaal: 258.05 € 1.17.4564.29482 1/1        "

我试图做的是读取每一行，并确定这是否是订单行，如果是，我需要将其放入数据库

一个订单行看起来像

2 1 x KR01 - Eye Pencil DECADENCE BLACK 6.00

你可以这样读；订单行号2，产品KR01数量1说明眼笔颓废黑色，价格6.00

有没有一种简单的方法来读取这个长字符串并用正确的顺序行将其分开

谢谢你的回复

到目前为止，我的代码是：

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package part4.chapter15;

import com.itextpdf.text.pdf.PdfArray;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;

public class ExtractPageContent {

    /** The original PDF that will be parsed. */
    public static final String PREFACE = "C:/Users/marcel/Documents/FM/NL/FMPoint        /Kassa_voorraad_software/PDF-Itext/PDF_Results_Import_Files/small.pdf" ;
    /** The resulting text file. */
    public static final String RESULT = "C:/Users/marcel/Documents/FM/NL/FMPoint        /Kassa_voorraad_software/PDF-Itext/PDF_Results_Import_Files/sample-            result.txt" ;

    /**
     * Parses a PDF to a plain text file.
     * @param pdf the original PDF
     * @param txt the resulting text
 * @throws IOException
 */
public void parsePdf(String pdf, String txt) throws IOException {

        /** Putting result in Array, to be able extract to Table */
        PdfArray array;

        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            String str = strategy.getResultantText();
            CharSequence FindPage = "Lp. Rekening Totaal"; 
            if  (str.contains(FindPage)){ 
              out.println(strategy.getResultantText());
        }
        }
        out.flush();
        out.close();
    }

    /**
     * Main method.
     * @param    args    no arguments needed
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        new ExtractPageContent().parsePdf(PREFACE, RESULT);
    }

}

Tags:

共 (2) 个答案

# 1 楼答案

您可以设计一个regex以多种不同的方式解决这个问题。这里有一个：

    String pdf = "LEVERINGSBON 30/06/2012 27828/2012/NL/WebShop   Distributeur ID nummer: 15099191 Uw distributeur: Klant Naam: FM Point Marcel Snoeck Adres: Zonnedauw 17 5953MS Reuver Telefoon: +31654317017 E-MAIL: yvonneenmarcel@home.nl Opmerking: - Lp. Rekening Totaal FV/39525/2012/NL     vd Wal Sandra 72.00 1 3 x 354 - Luxury Collection 50ml NEW! 72.00 FV/39526/2012/NL     Slaats Tim 6.00 2 1 x KR01 - Eye Pencil DECADENCE BLACK 6.00 FV/39527/2012/NL     Nabben Britt 44.95 3 3 x E013 - Krachtreiniger 1000ml 24.75 4 2 x E016 -Tapijtreiniger 1000ml 9.20 5 1 x 3 Step Mascara PERFECT BLACK 11.00 FV/39528/2012/NL     Nabben Lieke 32.00 6 1 x 192 - Luxury Collection 50ml 21.00 7 1 x 3 Step Mascara PERFECT BLACK 11.00 FV/39529/2012/NL     Claessens Patrick 12.40 8 1 x P101 - Peeling VERBENA 12.40 FV/39530/2012/NL     Smits Yolanda 56.00 9 1 x E006 - Wasmiddel VIVID COLOURS 1000ml 7.00 10 2 x B023 - Body Lotion 200ml NEW 18.40 11 2 x 023 - Classic Collection 30ml 30.60 FV/39531/2012/NL     van Pol-Thijssen Silvia 34.70 12 1 x 110 - Classic Collection 50ml 15.30 13 1 x N003 - Nagellak HOT RED 7.00 14 1 x P103 - Peeling CHERRY BLOSSOM 12.40 Aantal: 21 Totaal: 258.05 € 1.17.4564.29482 1/1        ";
    String patternString = "\\d\\s\\d\\sx.*?\\d\\.\\d\\d";
    Matcher matcher = Pattern.compile(patternString).matcher(pdf);
    List<String> dataRows = new ArrayList<String>();
    while (matcher.find()) {
        dataRows.add(matcher.group());
    }
    System.out.println(dataRows);

正则表达式的解释：
\\d\\s\\d\\sx：匹配数字、空格、数字、空格，'x'
.*?：匹配任意字符的任意数量，但匹配非贪婪Why is this important? \\d\.\\d\\d：用两个小数点匹配最后一个数字
这可能需要根据数据的变化情况进行调整，但这应该是一个很好的起点

如果需要自定义数据结构的列表而不是字符串，则可以获得匹配的各个部分，如下所示：

...  
String patternString = "(\\d)\\s(\\d)\\sx.*?\\d\\.\\d\\d";
...
while (matcher.find()) {
    MyDataObj m = new MyDataObj();
    m.setSomeField(dataRows.add(matcher.group(1)));
    m.setAnotherField(dataRows.add(matcher.group(2)));
}

只需在模式中的parathensis中包含您希望保留的每个值，并使用matcher.group(1)、matcher.group(2)等检索它们（matcher.group(0)提供了整个匹配项）

# 2 楼答案

答案的结果很好以下代码的结果如下所示：

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package part4.chapter15;

import com.itextpdf.text.pdf.PdfArray;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ExtractPageContent {

/** The original PDF that will be parsed. */
    public static final String PREFACE = "C:/Users/marcel/Documents/FM/NL/FMPoint/Kassa_voorraad_software/PDF-Itext/PDF_Results_Import_Files/big.pdf" ;
    /** The resulting text file. */
    public static final String RESULT = "C:/Users/marcel/Documents/FM/NL/FMPoint/Kassa_voorraad_software/PDF-Itext/PDF_Results_Import_Files/sample-result.txt" ;

    /**
     * Parses a PDF to a plain text file.
     * @param pdf the original PDF
     * @param txt the resulting text
     * @throws IOException
     */
    public void parsePdf(String pdf, String txt) throws IOException {

        /** Putting result in Array, to be able extract to Table */
        PdfArray array;

        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            String str = strategy.getResultantText();
            CharSequence FindPage = "Lp. Rekening Totaal"; 
            if  (str.contains(FindPage)){ 
/*                Pattern re =  Pattern.compile("(\\d+)\\s(\\d+)(\\xA0)x(\\xA0)(.*?)(\\d+\\.\\d{2})"); */
                /* Pattern for orders of Artikels with product Code */
                Pattern re2 =  Pattern.compile("(\\d+)\\s(\\d+)(\\xA0)x(\\xA0)(\\w+)(\\xA0)-\\s(.*?)(\\d+\\.\\d{2})"); 
                Matcher m = re2.matcher(str);
                int mIdx = 0;
                while (m.find()){
                    for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
                        /*System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));*/
                        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
                    }
                    mIdx++;
                }

/**     System.out.println(dataRows); */

          out.println(strategy.getResultantText());
    }
    }
    out.flush();
    out.close();
}


/**
 * Main method.
 * @param    args    no arguments needed
 * @throws IOException
 */
public static void main(String[] args) throws IOException {
    new ExtractPageContent().parsePdf(PREFACE, RESULT);
}

}

输出结果如下所示

完整订单行[0][0]=4 3 x 023-经典系列30ml 45.90

行数[0][1]=4

数量[0][2]=3

空[0][3]=

空[0][4]=

产品代码[0][5]=023

空[0][6]=

产品说明[0][7]=经典系列30ml

价格[0][8]=45.90

[1][0]=5 2 x C052-手部和指甲霜100ml新15.20

[1][1]=5

[1][2]=2

[1][3]=

[1][4]=

[1][5]=C052

[1][6]=

[1][7]=100毫升新手指甲霜

[1][8]=15.20

谢谢你的大力支持

Python中文网

有 Java 编程相关的问题?

java如何将长字符串转换为数组或数据库字段？

共 (2) 个答案

# 1 楼答案

# 2 楼答案