有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java为什么JVM在一次繁忙的旋转暂停后,会对同一代码块显示更多的延迟?

下面的代码明确地说明了问题,即:

The exact same block of code becomes slower after a busy spin pause.

请注意,我当然没有使用Thread.sleep。还请注意,当我使用数学运算而不是IF更改暂停时,没有导致热点/JIT取消优化的条件

  • 有一个数学运算块,我想计时
  • 首先,在开始测量之前,我将块暂停1纳秒。我做了两万次
  • 然后我将暂停时间从1纳秒改为5秒,然后像往常一样继续测量延迟。我做了15次
  • 然后我打印最后30个测量值,这样你可以看到15个测量值,暂停1纳秒,15个测量值,暂停5秒

如下图所示,差异很大,尤其是在暂停更改后的第一次测量中这是为什么

$ java -server -cp . JvmPauseLatency
Sat Apr 29 10:34:28 EDT 2017 => Please wait 75 seconds for the results...
Sat Apr 29 10:35:43 EDT 2017 => Calculation: 4.0042328611017236E11
Results:
215
214
215
214
215
214
217
215
216
214
216
213
215
214
215
2343 <----- FIRST MEASUREMENT AFTER PAUSE CHANGE
795
727
942
778
765
856
762
801
708
692
765
776
780
754

守则:

import java.util.Arrays;
import java.util.Date;
import java.util.Random;

public class JvmPauseLatency {

    private static final int WARMUP = 20000;
    private static final int EXTRA = 15;
    private static final long PAUSE = 5 * 1000000000L; // in nanos

    private final Random rand = new Random();
    private int count;
    private double calculation;
    private final long[] results = new long[WARMUP + EXTRA];
    private long interval = 1; // in nanos

    private long busyPause(long pauseInNanos) {
        final long start = System.nanoTime();
        long until = Long.MAX_VALUE;
        while(System.nanoTime() < until) {
           until = start + pauseInNanos;
        }
        return until;
    }

    public void run() {

        long testDuration = ((WARMUP * 1) + (EXTRA * PAUSE)) / 1000000000L;
        System.out.println(new Date() +" => Please wait " + testDuration + " seconds for the results...");

        while(count < results.length) {

            double x = busyPause(interval);

            long latency = System.nanoTime();

            calculation += x / (rand.nextInt(5) + 1);
            calculation -= calculation / (rand.nextInt(5) + 1);
            calculation -= x / (rand.nextInt(6) + 1);
            calculation += calculation / (rand.nextInt(6) + 1);

            latency = System.nanoTime() - latency;

            results[count++] = latency;
            interval = (count / WARMUP * (PAUSE - 1)) + 1; // it will change to PAUSE when it reaches WARMUP
        }

        // now print the last (EXTRA * 2) results so you can compare before and after the pause change (from 1 to PAUSE)
        System.out.println(new Date() + " => Calculation: " + calculation);
        System.out.println("Results:");
        long[] array = Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length);
        for(long t: array) System.out.println(t);
    }

    public static void main(String[] args) {
        new JvmPauseLatency().run();
    }
}

共 (2) 个答案

  1. # 1 楼答案

    TL;博士

    http://www.brendangregg.com/activebenchmarking.html

    casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C.

    问题N1。暂停更改后的第一次测量

    看起来你面临的是on-stack replacement。当OSR发生时,VM暂停,目标函数的堆栈帧替换为等效帧

    根案例是错误的微基准-它没有正确地预热。在while循环之前,只需在基准测试中插入以下行即可修复它:

    System.out.println("WARMUP = " + busyPause(5000000000L));
    

    如何检查-只需使用-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls运行您的基准测试。我已经修改了您的代码-现在它会在每次调用之前将间隔打印到系统输出中:

    interval = 1
    interval = 1
    interval = 5000000000
        689  145       4       JvmPauseLatency::busyPause (19 bytes)   made not entrant
        689  146       3       JvmPauseLatency::busyPause (19 bytes)
    Installing method (3) JvmPauseLatency.busyPause(J)J 
        698  147 %     4       JvmPauseLatency::busyPause @ 6 (19 bytes)
    Installing osr method (4) JvmPauseLatency.busyPause(J)J @ 6
        702  148       4       JvmPauseLatency::busyPause (19 bytes)
        705  146       3       JvmPauseLatency::busyPause (19 bytes)   made not entrant
    Installing method (4) JvmPauseLatency.busyPause(J)J 
    interval = 5000000000
    interval = 5000000000
    interval = 5000000000
    interval = 5000000000
    

    OSR通常发生在第4层,因此为了禁用它,您可以使用以下选项:

    • -XX:-TieredCompilation禁用分层编译
    • -XX:-TieredCompilation -XX:TieredStopAtLevel=3禁用分层编译到4级
    • -XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement禁用OSR

    问题N2。如何测量

    让我们从这篇文章开始。简言之:

    • JIT只能编译一个方法——在测试中只有一个循环,所以只有OSR可用于测试
    • 您正在尝试测量一些小的东西,可能小于nanoTime()调用(请参见What is the cost of volatile write?
    • 微体系结构级别–缓存、CPU管道暂停非常重要,例如,TLB未命中或分支预测失误所花费的时间超过测试执行时间

    因此,为了避免所有这些陷阱,您可以使用基于JMH的基准测试,如下所示:

    import org.openjdk.jmh.annotations.*;
    import org.openjdk.jmh.infra.Blackhole;
    import org.openjdk.jmh.runner.Runner;
    import org.openjdk.jmh.runner.RunnerException;
    import org.openjdk.jmh.runner.options.Options;
    import org.openjdk.jmh.runner.options.OptionsBuilder;
    import org.openjdk.jmh.runner.options.VerboseMode;
    
    import java.util.Random;
    import java.util.concurrent.TimeUnit;
    
    @State(Scope.Benchmark)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
    @Measurement(iterations = 2, time = 3, timeUnit = TimeUnit.SECONDS)
    @Fork(value = 2)
    public class LatencyTest {
    
        public static final long LONG_PAUSE = 5000L;
        public static final long SHORT_PAUSE = 1L;
        public Random rand;
    
        @Setup
        public void initI() {
            rand = new Random(0xDEAD_BEEF);
        }
    
        private long busyPause(long pauseInNanos) {
            Blackhole.consumeCPU(pauseInNanos);
            return pauseInNanos;
        }
    
        @Benchmark
        @BenchmarkMode({Mode.AverageTime})
        public long latencyBusyPauseShort() {
            return busyPause(SHORT_PAUSE);
        }
    
        @Benchmark
        @BenchmarkMode({Mode.AverageTime})
        public long latencyBusyPauseLong() {
            return busyPause(LONG_PAUSE);
        }
    
        @Benchmark
        @BenchmarkMode({Mode.AverageTime})
        public long latencyFunc() {
            return doCalculation(1);
        }
    
        @Benchmark
        @BenchmarkMode({Mode.AverageTime})
        public long measureShort() {
            long x = busyPause(SHORT_PAUSE);
            return doCalculation(x);
        }
    
        @Benchmark
        @BenchmarkMode({Mode.AverageTime})
        public long measureLong() {
            long x = busyPause(LONG_PAUSE);
            return doCalculation(x);
        }
    
        private long doCalculation(long x) {
            long calculation = 0;
            calculation += x / (rand.nextInt(5) + 1);
            calculation -= calculation / (rand.nextInt(5) + 1);
            calculation -= x / (rand.nextInt(6) + 1);
            calculation += calculation / (rand.nextInt(6) + 1);
            return calculation;
        }
    
        public static void main(String[] args) throws RunnerException {
            Options options = new OptionsBuilder()
                    .include(LatencyTest.class.getName())
                    .verbosity(VerboseMode.NORMAL)
                    .build();
            new Runner(options).run();
        }
    }
    

    请注意,为了避免操作系统相关的影响,我已将忙循环实现更改为Blackhole#consumeCPU()。因此,我的结果是:

    Benchmark                          Mode  Cnt      Score     Error  Units
    LatencyTest.latencyBusyPauseLong   avgt    4  15992.216 ± 106.538  ns/op
    LatencyTest.latencyBusyPauseShort  avgt    4      6.450 ±   0.163  ns/op
    LatencyTest.latencyFunc            avgt    4     97.321 ±   0.984  ns/op
    LatencyTest.measureLong            avgt    4  16103.228 ± 102.338  ns/op
    LatencyTest.measureShort           avgt    4    100.454 ±   0.041  ns/op
    

    请注意,结果几乎是相加的,即latencyFunc+latencyBusyPauseShort=measureShort

    问题N3。差距很大

    你的考试怎么了?它不能正确地预热JVM,即它使用一个参数来预热,另一个参数来测试。为什么这很重要?JVM使用概要文件引导的优化,例如,它统计分支的执行频率,并为特定概要文件生成“最佳”(无分支)代码。因此,我们试图用参数1预热JVM我们的基准测试,JVM生成“最佳代码”,而while循环中的分支从未执行过。以下是来自JIT编译日志(-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation)的事件:

    <branch prob="0.0408393" not_taken="40960" taken="1744" cnt="42704" target_bci="42"/> 
    

    属性更改后,JIT使用不常见的陷阱来处理非最佳代码。我已经创建了一个基准,它基于原始基准,并做了一些小改动:

    • busyPause被JMH中的consumeCPU所取代,以使纯java基准测试不与系统交互(实际上nano time使用userland函数vdso clock_gettime,我们无法分析此代码)
    • 所有计算都将被删除

    _

    import java.util.Arrays;
    
    public class JvmPauseLatency {
    
        private static final int WARMUP = 2000 ;
        private static final int EXTRA = 10;
        private static final long PAUSE = 70000L; // in nanos
        private static volatile long consumedCPU = System.nanoTime();
    
        //org.openjdk.jmh.infra.Blackhole.consumeCPU()
        private static void consumeCPU(long tokens) {
            long t = consumedCPU;
            for (long i = tokens; i > 0; i ) {
                t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
            }
            if (t == 42) {
                consumedCPU += t;
            }
        }
    
        public void run(long warmPause) {
            long[] results = new long[WARMUP + EXTRA];
            int count = 0;
            long interval = warmPause;
            while(count < results.length) {
    
                consumeCPU(interval);
    
                long latency = System.nanoTime();
                latency = System.nanoTime() - latency;
    
                results[count++] = latency;
                if (count == WARMUP) {
                    interval = PAUSE;
                }
            }
    
            System.out.println("Results:" + Arrays.toString(Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length)));
        }
    
        public static void main(String[] args) {
            int totalCount = 0;
            while (totalCount < 100) {
                new JvmPauseLatency().run(0);
                totalCount ++;
            }
        }
    }
    

    结果是

    Results:[62, 66, 63, 64, 62, 62, 60, 58, 65, 61, 127, 245, 140, 85, 88, 114, 76, 199, 310, 196]
    Results:[61, 63, 65, 64, 62, 65, 82, 63, 67, 70, 104, 176, 368, 297, 272, 183, 248, 217, 267, 181]
    Results:[62, 65, 60, 59, 54, 64, 63, 71, 48, 59, 202, 74, 400, 247, 215, 184, 380, 258, 266, 323]
    

    为了修复此基准,只需将new JvmPauseLatency().run(0)替换为new JvmPauseLatency().run(PAUSE);,结果如下:

    Results:[46, 45, 44, 45, 48, 46, 43, 72, 50, 47, 46, 44, 54, 45, 43, 43, 43, 48, 46, 43]
    Results:[44, 44, 45, 45, 43, 46, 46, 44, 44, 44, 43, 49, 45, 44, 43, 49, 45, 46, 45, 44]
    

    如果您想动态地更改“暂停”,您必须动态地预热JVM,即

        while(count < results.length) {
    
            consumeCPU(interval);
    
            long latency = System.nanoTime();
            latency = System.nanoTime() - latency;
    
            results[count++] = latency;
            if (count >= WARMUP) {
                interval = PAUSE;
            } else {
                interval =  rnd.nextBoolean() ? PAUSE : 0;
            }
        }
    

    问题N4。Xint-口译员呢

    在基于开关的解释器的情况下,我们有很多问题,主要是间接分支指令。我做了3个实验:

    1. 随机预热
    2. 持续预热,0暂停
    3. 整个测试使用暂停0,包括

    每个实验都是通过以下命令sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency启动的,结果如下:

     Performance counter stats for 'java -Xint JvmPauseLatency':
    
       272,822,274,275      cycles                                                      
       723,420,125,590      instructions              #    2.65  insn per cycle         
            26,994,494      cache-references                                            
             8,575,746      cache-misses              #   31.769 % of all cache refs    
         2,060,138,555      bus-cycles                                                  
             2,930,155      branch-misses                                               
    
          86.808481183 seconds time elapsed
    
     Performance counter stats for 'java -Xint JvmPauseLatency':
    
         2,812,949,238      cycles                                                      
         7,267,497,946      instructions              #    2.58  insn per cycle         
             6,936,666      cache-references                                            
             1,107,318      cache-misses              #   15.963 % of all cache refs    
            21,410,797      bus-cycles                                                  
               791,441      branch-misses                                               
    
           0.907758181 seconds time elapsed
    
     Performance counter stats for 'java -Xint JvmPauseLatency':
    
           126,157,793      cycles                                                      
           158,845,300      instructions              #    1.26  insn per cycle         
             6,650,471      cache-references                                            
               909,593      cache-misses              #   13.677 % of all cache refs    
             1,635,548      bus-cycles                                                  
               775,564      branch-misses                                               
    
           0.073511817 seconds time elapsed
    

    在分支未命中的情况下,由于内存占用过大,延迟和占用空间呈非线性增长

  2. # 2 楼答案

    您可能无法依赖任何计时器的精度来获得您想要的精度,https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime 指出

    This method provides nanosecond precision, but not necessarily nanosecond resolution (that is, how frequently the value changes) - no guarantees are made except that the resolution is at least as good as that of currentTimeMillis().