有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

Java分割字符串性能

以下是我的应用程序中的当前代码:

String[] ids = str.split("/");

在分析应用程序时,字符串拆分将花费不可忽略的时间。另外,split方法采用正则表达式,这在这里是多余的

我可以使用什么替代方法来优化字符串拆分?{}快吗

(我会自己尝试和测试,但分析我的应用程序需要很多时间。)


共 (6) 个答案

  1. # 1 楼答案

    如果可以使用第三方库,Guava's^{}在不需要正则表达式时,不会产生正则表达式的开销,而且一般来说速度非常快。(披露:我为番石榴捐款。)

    Iterable<String> split = Splitter.on('/').split(string);
    

    (而且,Splitter通常比String.splitmuch more predictable。)

  2. # 2 楼答案

    鉴于我正在大规模工作,我认为提供更多的基准测试会有所帮助,包括我自己的一些实现(我在空间上进行了划分,但这应该说明一般需要多长时间):

    我正在处理一个426MB的文件,有2622761行。唯一的空白是普通空格(“”)和行(“\n”)

    首先,我将所有行替换为空格,并对一条巨大的行进行基准分析:

    .split(" ")
    Cumulative time: 31.431366952 seconds
    
    .split("\s")
    Cumulative time: 52.948729489 seconds
    
    splitStringChArray()
    Cumulative time: 38.721338004 seconds
    
    splitStringChList()
    Cumulative time: 12.716065893 seconds
    
    splitStringCodes()
    Cumulative time: 1 minutes, 21.349029036000005 seconds
    
    splitStringCharCodes()
    Cumulative time: 23.459840685 seconds
    
    StringTokenizer
    Cumulative time: 1 minutes, 11.501686094999997 seconds
    

    然后,我对逐行拆分进行基准测试(这意味着函数和循环要多次执行,而不是一次全部执行):

    .split(" ")
    Cumulative time: 3.809014174 seconds
    
    .split("\s")
    Cumulative time: 7.906730124 seconds
    
    splitStringChArray()
    Cumulative time: 4.06576739 seconds
    
    splitStringChList()
    Cumulative time: 2.857809996 seconds
    
    Bonus: splitStringChList(), but creating a new StringBuilder every time (the average difference is actually more like .42 seconds):
    Cumulative time: 3.82026621 seconds
    
    splitStringCodes()
    Cumulative time: 11.730249921 seconds
    
    splitStringCharCodes()
    Cumulative time: 6.995555826 seconds
    
    StringTokenizer
    Cumulative time: 4.500008172 seconds
    

    以下是代码:

    // Use a char array, and count the number of instances first.
    public static String[] splitStringChArray(String str, StringBuilder sb) {
        char[] strArray = str.toCharArray();
        int count = 0;
        for (char c : strArray) {
            if (c == ' ') {
                count++;
            }
        }
        String[] splitArray = new String[count+1];
        int i=0;
        for (char c : strArray) {
            if (c == ' ') {
                splitArray[i] = sb.toString();
                sb.delete(0, sb.length());
            } else {
                sb.append(c);
            }
        }
        return splitArray;
    }
    
    // Use a char array but create an ArrayList, and don't count beforehand.
    public static ArrayList<String> splitStringChList(String str, StringBuilder sb) {
        ArrayList<String> words = new ArrayList<String>();
        words.ensureCapacity(str.length()/5);
        char[] strArray = str.toCharArray();
        int i=0;
        for (char c : strArray) {
            if (c == ' ') {
                words.add(sb.toString());
                sb.delete(0, sb.length());
            } else {
                sb.append(c);
            }
        }
        return words;
    }
    
    // Using an iterator through code points and returning an ArrayList.
    public static ArrayList<String> splitStringCodes(String str) {
        ArrayList<String> words = new ArrayList<String>();
        words.ensureCapacity(str.length()/5);
        IntStream is = str.codePoints();
        OfInt it = is.iterator();
        int cp;
        StringBuilder sb = new StringBuilder();
        while (it.hasNext()) {
            cp = it.next();
            if (cp == 32) {
                words.add(sb.toString());
                sb.delete(0, sb.length());
            } else {
                sb.append(cp);
            }
        }
    
        return words;
    }
    
    // This one is for compatibility with supplementary or surrogate characters (by using Character.codePointAt())
    public static ArrayList<String> splitStringCharCodes(String str, StringBuilder sb) {
        char[] strArray = str.toCharArray();
        ArrayList<String> words = new ArrayList<String>();
        words.ensureCapacity(str.length()/5);
        int cp;
        int len = strArray.length;
        for (int i=0; i<len; i++) {
            cp = Character.codePointAt(strArray, i);
            if (cp == ' ') {
                words.add(sb.toString());
                sb.delete(0, sb.length());
            } else {
                sb.append(cp);
            }
        }
    
        return words;
    }
    

    以下是我使用StringTokenizer的方式:

        StringTokenizer tokenizer = new StringTokenizer(file.getCurrentString());
        words = new String[tokenizer.countTokens()];
        int i = 0;
        while (tokenizer.hasMoreTokens()) {
            words[i] = tokenizer.nextToken();
            i++;
        }
    
  3. # 3 楼答案

    ^{}对于像这样的简单解析来说要快得多(我之前做过一些基准测试,你得到了巨大的加速)

    StringTokenizer st = new StringTokenizer("1/2/3","/");
    String[] arr = new String[st.countTokens()];
    arr[0] = st.nextToken();
    

    如果您想提高性能,也可以手动执行:

    String s = "1/2/3"
    char[] c = s.toCharArray();
    LinkedList<String> ll = new LinkedList<String>();
    int index = 0;
    
    for(int i=0;i<c.length;i++) {
        if(c[i] == '/') {
            ll.add(s.substring(index,i));
            index = i+1;
        }
    }
    
    String[] arr = ll.size();
    Iterator<String> iter = ll.iterator();
    index = 0;
    
    for(index = 0; iter.hasNext(); index++)
        arr[index++] = iter.next();
    
  4. # 4 楼答案

    Guava有一个Splitter方法,它比String.split()方法更灵活,并且(不一定)使用正则表达式。OTOH,String.split()在Java7中进行了优化,以避免在分隔符是单个字符时使用正则表达式机制。因此,Java 7的性能应该类似

  5. # 5 楼答案

    根据this post,速度大约是原来的两倍

    然而,除非你的应用程序规模巨大,split对你来说应该没问题(c.f.同一篇文章,它在几毫秒内引用了数千个字符串)

  6. # 6 楼答案

    如果模式只有一个字符长,那么String.split(String)将不会创建regexp。当按单个字符拆分时,它将使用非常高效的专用代码^在这种特殊情况下,{}的速度不会快很多

    这是在OpenJDK7/OracleJDK7中引入的Here's a bug reporta commit。我做了一个simple benchmark here


    $ java -version
    java version "1.8.0_20"
    Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
    Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)
    
    $ java Split
    split_banthar: 1231
    split_tskuzzy: 1464
    split_tskuzzy2: 1742
    string.split: 1291
    StringTokenizer: 1517