java优化CSV解析以提高速度

3 月 Questions & Answers 80

我正在开发这个“程序”，它从两个大型csv文件（逐行）读取数据，比较文件中的数组元素，当找到匹配项时，它会将我所需的数据写入第三个文件。我唯一的问题是速度很慢。它每秒读取1-2行，速度非常慢，因为我有数百万条记录。有什么办法可以加快速度吗？这是我的密码：

     public class ReadWriteCsv {

public static void main(String[] args) throws IOException {

    FileInputStream inputStream = null;
    FileInputStream inputStream2 = null;
    Scanner sc = null;
    Scanner sc2 = null;
    String csvSeparator = ",";
    String line;
    String line2;
    String path = "D:/test1.csv";
    String path2 = "D:/test2.csv";
    String path3 = "D:/newResults.csv";
    String[] columns;
    String[] columns2;
    Boolean matchFound = false;
    int count = 0;
    StringBuilder builder = new StringBuilder();

    FileWriter writer = new FileWriter(path3);

    try {
        // specifies where to take the files from
        inputStream = new FileInputStream(path);
        inputStream2 = new FileInputStream(path2);

        // creating scanners for files
        sc = new Scanner(inputStream, "UTF-8");

        // while there is another line available do:
        while (sc.hasNextLine()) {
            count++;
            // storing the current line in the temporary variable "line"
            line = sc.nextLine();
            System.out.println("Number of lines read so far: " + count);
            // defines the columns[] as the line being split by ","
            columns = line.split(",");
            inputStream2 = new FileInputStream(path2);
            sc2 = new Scanner(inputStream2, "UTF-8");

            // checks if there is a line available in File2 and goes in the
            // while loop, reading file2
            while (!matchFound && sc2.hasNextLine()) {
                line2 = sc2.nextLine();
                columns2 = line2.split(",");

                if (columns[3].equals(columns2[1])) {
                    matchFound = true;
                    builder.append(columns[3]).append(csvSeparator);
                    builder.append(columns[1]).append(csvSeparator);
                    builder.append(columns2[2]).append(csvSeparator);
                    builder.append(columns2[3]).append("\n");
                    String result = builder.toString();
                    writer.write(result);
                }

            }
            builder.setLength(0);
            sc2.close();
            matchFound = false;
        }

        if (sc.ioException() != null) {
            throw sc.ioException();

        }

    } finally {
        //then I close my inputStreams, scanners and writer

public void diff(File leftInput, File rightInput) { CsvParserSettings settings = new CsvParserSettings(); //many config options here, check the tutorial CsvParser leftParser = new CsvParser(settings); CsvParser rightParser = new CsvParser(settings); leftParser.beginParsing(leftInput); rightParser.beginParsing(rightInput); String[] left; String[] right; int row = 0; while ((left = leftParser.parseNext()) != null && (right = rightParser.parseNext()) != null) { row++; if (!Arrays.equals(left, right)) { System.out.println(row + ":\t" + Arrays.toString(left) + " != " + Arrays.toString(right)); } } leftParser.stopParsing(); rightParser.stopParsing(); }

共 (2) 个答案

# 1 楼答案

使用univocity-parsersCSV解析器，因为处理两个文件（每个文件有一百万行）所需的时间不会超过几秒钟：

披露：我是这个图书馆的作者。它是开源和免费的（Apache V2.0许可证）

# 2 楼答案

使用现有的CSV库，而不是滚动您自己的库。它将比你现在拥有的更加强大

但是，您的问题不是CSV解析速度，而是您的算法是O（n^2），对于第一个文件中的每一行，您需要扫描第二个文件。这种算法会随着数据的大小而迅速膨胀，当您有数百万行时，您会遇到问题。你需要一个更好的算法

另一个问题是每次扫描都要重新解析第二个文件。您至少应该在程序开始时将其作为ArrayList或其他内容读入内存，这样您只需要加载和解析它一次

Python中文网

有 Java 编程相关的问题?

java优化CSV解析以提高速度

共 (2) 个答案

# 1 楼答案

# 2 楼答案