the asm.js benchmark

Run the benchmarks now

Benchmarks will take a while to run, and some browsers may become less responsive for part of that time.



asm.js appears in one test in Octane, and in several tests in JetStream, which is great, but several aspects of asm.js performance are not fully measured in those. In particular, asm.js often appears as large (you might even say massive) source files, which can have different performance characteristics than the typical small programs appearing in most benchmarks. For example, a very large codebase may contain very large functions, which due to their size are difficult to optimize efficiently (either not fully optimized, or optimized slowly in a noticeable way), or just the sheer number of functions may be very high and cause the browser to pause as the codebase is parsed and starts to execute. Such very large codebases can therefore bring new challenges to JavaScript engines (or rather, more extreme versions of familiar challenges), and it is important to measure performance on them because they are showing up with increasing frequency on the web (for example, as native plugins are fading out, game engine companies like Unity and Epic are starting to compile their large codebases to asm.js). For these reasons, the Massive benchmark includes several very large codebases (Poppler, SQLite, etc.), and measures throughput as well as responsiveness, variability and startup time (see details below).

The Emscripten benchmark suite evolved over time in order to benchmark Emscripten itself, and therefore mainly focused on throughput, and is runnable in both shell and browser. Massive, on the other hand, tests not just throughput but also browser responsiveness and other factors that only make sense when running in a browser, things not measured by the Emscripten benchmark suite (or by the main JavaScript benchmarks).

Main Thread Responsiveness measures the user experience as a large codebase is loaded. What is tested is whether the main thread stalls as the codebase is prepared and executed for a short while. The score here can be improved by parsing the code off the main thread, for example. This does not measure how much time is spent, but only how responsive or unresponsive the user experience is (how much time is spent is measured by Preparation, and to some extent Throughput). Technically, we measure responsiveness by seeing if events on the main thread execute at the proper interval (as when the main thread stalls, it stalls both the user experience and other events).

Throughput measures how fast a large computational workload runs. This is what is typically measured by benchmarks. Massive's throughput tests focus on very large real-world codebases.

Preparation measures how long (in wall time) is spent to get a codebase ready to execute, before executing any of it. This measures how much time passes between adding a script tag with that code and being able to call the code (this may or may not cause a user-noticeable pause, depending on whether it is parsed on or off the main thread; Main Thread Responsiveness tests that aspect). "Preparation" is basically all the time before code is actually able to run; that may include parsing, conversion to bitcode, JIT compilation, etc., depending on the JS engine.

Variance measures how variable the frame rate is in an application that needs to run in each frame (this is important in things like games, which must finish all they need every 1/60 of a second, in order to be smooth). Specifically, we run many frames and then calculate the statistical variance and worst case. Note that one VM might have a much faster overall frame rate than another, but also more variance: in general, given two VMs with the same average, the one with less variance is "better" since it's smoother. But given a different mean, things are less clear (perhaps we are happy to get some average slowdown in order to reduce variance which can cause noticable but rare pauses?). Hence we measure variance separately from throughput (which is a measurement of the total speed, and is proportional to the average).

Most of the tests, in particular the throughput ones, are generally very consistent, as we run a deterministic workload in a web worker, which minimizes outside noise. We also run a few repetitions and average the results. However, in particular the Main Thread Responsiveness tests need to run on the main thread, and they involve DOM events like adding a script tag, setInterval, etc., which can be fairly variable. We run a larger amount of repetitions on those tests to average out the noise, but even so they appear to be less consistent between runs on some browsers.

When we see the results of a test are too variable, we mark it with "(±X%!)" next to the score. The cause of such variability might be something else on your machine (perhaps a background indexing service happend to use a CPU core during a test, etc.), or it might be that the browser behaves unpredictably for some reason.

All the benchmarks here are from real-world C or C++ codebases:

  • Box2D: A 2D physics engine, used in many games, for example Angry Birds. Stresses floating-point processing performance. The workload is based on jgw's bench2D. (~30KLOC)
  • Lua: A script language that is used in many games as well as on Wikipedia. Here the entire Lua VM is compiled down to JavaScript, including interpreter, garbage collector, etc. The workloads used are the scimark and binarytrees benchmarks, which test raw computation and garbage collection, respectively. (~16KLOC)
  • Poppler: A PDF rendering engine, used by many applications, for example LibreOffice. Rendering PDFs requires many capabilities (font rendering, graphics, etc.), making this the largest of the codebases tested here, especially since it is built together with the FreeType font rendering library. The workload is Lawrence Lessig's "Free Culture". (~250KLOC)
  • SQLite: A complete transactional SQL database engine. Parsing and executing SQL queries is done using a large interpreter-loop type function, which is challenging to optimize. The workload is the SQLite speedtest1.c benchmark, which SQLite devs constructed to represent real-world usage patterns. (~128KLOC)

All of these codebases are open source, so you can build and inspect them yourself (the build tool, Emscripten, is of course open source as well).

Note that the KLOC numbers mentioned above do not include system libraries like libc and libc++, the necessary parts of which are necessarily included in the benchmarks.

Generally quite a while, as it is designed to execute fixed workloads of sufficient length to measure real-world performance on large applications. How long it takes will depend on the machine and browser, of course, but you can probably expect it to take at least a few minutes (on a desktop or laptop machine; a mobile device may take much more). Massive should not lock up your browser as it runs, however - except for the Main Thread Responsiveness tests, which run first, benchmarks are run in web workers (and even the Main Thread Responsiveness tests should not reduce responsiveness very much). Note that results of individual benchmarks show up when ready, so you can view those before all of Massive is complete.

Sure, use a URL like index.html?box2d-throughput,box2d-throughput-f32 to run just those benchmarks.

Some calculations have an "absolute optimal" result. For example, Variance measures how variable the frame rate is. If the frame rate is practically still - no jumping around at all - then the result is the maximum score of 10,000. For practical reasons, there is an absolute threshold: In the case of Variance, anything under 5ms is considered perfect; this avoids large differences between results like 2ms and 4ms (double the variance in the second!), because 5ms is already so small as to be below the threshold of noticeability.