Memory management in LWJGL

Java 1.4 was released in early 2002. LWJGL was created a few months later.

This was not a coincidence. Java 1.4 finally made it possible to access off-heap memory from Java and efficiently inter-operate with native code. Using java.nio and direct ByteBuffer instances, Java developers could:

  • Allocate off-heap memory.
  • Fill buffers with data and pass them down to native code, without expensive copies.
  • Access the contents of buffers managed by external libraries.

LWJGL 3 was born in late 2012. It is a clean rewrite of LWJGL, officially released on June 2016. Like its predecessor, it provides bindings to popular native libraries via JNI and uses NIO buffers to pass data around. The primary goal of the library is to provide the best performance/usability/safety balance possible, at least until Project Panama is released (hopefully) in Java 10.

In the meantime, NIO buffers have been getting a lot of bad rep, which is mostly justified. There are a lot of complains about how they're designed, how they're used (or not used) in various APIs, how they perform. The purpose of this post is to explain how LWJGL 3 addresses these issues.

Disclaimer:

The following applies to the platforms LWJGL supports: Linux, MacOS, Windows. LWJGL requires Java 8 and we have tested with OpenJDK, Oracle JDK, Azul Zulu, IBM JDK. Decisions made may or may not be optimal on other JDKs/JVMs. LWJGL does not support (or even work on) Android yet, but it's almost certain that different strategies will have to be used there.

Strategy #1

Heap buffers are strictly forbidden. All buffers must be off-heap (DirectByteBuffer and derivatives) to be accepted by the LWJGL API.

Heap buffers might be cheap to allocate and trivial to cleanup (via standard GC), but they force data copies when passed to native code.

They needlessly complicate all code paths that accept NIO buffers.

They make a lot of call sites bimorphic or even megamorphic. An application that uses direct buffers exclusively can do anything. An application that allows heap buffers cannot avoid direct buffers (e.g. when dealing with data managed by native libraries). This mix of concrete types means the JVM will have a harder job optimizing code.

All ByteBuffer instances created by LWJGL are in the native byte order. Views are always naturally aligned to the element type. If users are also careful with alignment, all buffers will be instantiated from the same concrete class on all platforms (e.g. DirectIntBufferU for the IntBuffer abstract class). You can easily have an application where all buffer call sites are monomorphic.

Strategy #2

Buffer instances passed to LWJGL APIs are never mutated.

This applies to the POJO wrappers themselves, not the off-heap memory they point to. In C terms, we treat e.g. an IntBuffer argument as a const pointer to int, not as a pointer to const int.

Standard NIO code typically changes the position of buffers it consumes, up to the point where data reads or writes occurred. This causes redundant (heap) mutations and requires a lot of flip(), reset() and clear() calls in user code.

Buffers passed to LWJGL APIs are never mutated. The values returned by position() and limit() after an LWJGL call are exactly the same as before. This leads to cleaner, easier to understand, code.

For the same reason, the use of absolute gets and puts is encouraged when writing LWJGL code.

Strategy #3

The JNI functions NewDirectByteBuffer, GetDirectBufferAddress and GetDirectBufferCapacity are never used.

In pure Java, you cannot:

  • create direct ByteBuffer instances at arbitrary memory addresses and
  • retrieve the memory address of a direct buffer instance.

The officially supported way to do this is in JNI code, using the above functions. This is a serious problem.

In LWJGL, buffer instances are never passed to or from native code:

  • The address of a buffer at its current position() is extracted using Unsafe and passed to JNI code as a jlong, not as a jobject. Then it is cast to the appropriate pointer type and used directly.
  • The address of buffers managed by native libraries is returned to Java code as jlong values, not as jobject. In Java, it is wrapped into the appropriate buffer instance (using Unsafe) and used directly.

First, lets address the elephant in the room. Yes, we do use Unsafe for this. We're not happy about it either, but there is no alternative. VarHandles in Java 9 will not help, because we're accessing private implementation details. Unless a pure Java alternative is provided in Java 9, this is the best we can do until Java 10 and Project Panama (looks like we'll stop using NIO buffers then anyway).

So, why go through all this trouble? Well, actually it's no trouble at all, because it greatly reduces the amount of JNI code required. Not only because the above functions are never called, but it also moves all pointer arithmetic to Java code (GetDirectBufferAddress returns the base buffer address, regardless of current position() and buffer element type). In addition to smaller JNI functions, LWJGL is able to generate Java method overloads with different buffer types, all reusing the same native method. Fixing bugs is also easier. Using standard Java tools and debuggers you can easily identify issues since the native code basically does nothing and all the relevant information is on the Java side.

But the greatest benefit is another and it is what differentiates LWJGL from other libraries. Passing Java objects to and from native code makes them escape, by definition. This means that escape analysis can never eliminate allocations of such objects when dealing with standard JNI code.

LWJGL's approach on the other hand lets escape analysis do its job. In the vast majority of cases, where buffers are allocated locally, used in native methods, then discarded, EA works great and eliminates the Java-side allocations. When it doesn't work, the JVM provides sufficient diagnostic flags (e.g. -XX:+PrintEscapeAnalysis) that enable troubleshooting. Usually a little bit of refactoring and writing cleaner code resolves the issue.

Strategy #4

Use of ByteBuffer.allocateDirect() is highly discouraged.

LWJGL versions before 3 relied exclusively on allocateDirect(), via the org.lwjgl.BufferUtils class. This class is still there in 3, for backwards-compatibility, but its use is highly discouraged. The reason is simple, allocateDirect() is horrible:

  • It is slow, much slower than the raw malloc() call. A lot of overhead on top of a function that is already slow.
  • It scales badly under contention.
  • It arbitrarily limits the amount of allocated memory (-XX:MaxDirectMemorySize).
  • Like Java arrays, the allocated memory is always zeroed-out. This is not necessarily bad, but having the option would be better.
  • There's no way to deallocate the allocated memory on demand (without JDK-specific reflection hacks). Instead, a reference queue is used that usually requires two GC cycles to free the native memory. This quite often leads to OOM errors under pressure.

All of that contributed to users adopting one or more bad practices when dealing with LWJGL code:

  • Using buffer instance pools.
  • Using global buffer instances, often not in a concurrent-safe way.
  • Allocating big buffers and writing their own "memory allocator" on top, often with far from optimal results with respect to performance, memory utilization & fragmentation, etc.

The recommended way to deal with these issues in LWJGL 3 is via explicit memory management. We have the org.lwjgl.system.MemoryUtil class that uses a configurable memory allocator and exposes a user-friendly API, but other allocators can be used directly, as dictated by application requirements.

We have the org.lwjgl.system.libc.Stdlib class that basically exposes the system default memory allocator. It provides the functions you would expect: malloc, calloc, realloc, free, aligned_alloc, aligned_free.

We also bundle the jemalloc library with LWJGL. It provides the same functions as above and many more for specialized use-cases. This is the default allocator used in MemoryUtil (when available and not overridden), because:

  • It is generally faster than the system default.
  • It scales better under concurrent allocations.
  • It is highly configurable and tunable for specific applications.
  • The build that comes with LWJGL is tuned for performance, but it can easily be replaced with a custom build that enables more features: statistics, debugging, leak detection, etc.

I will not discuss here how explicit memory management is not the "Java way", it's a topic for another post. Hint: I mostly agree, but not in the context of inter-operating with native systems.

Strategy #5

Always prefer stack allocation.

As fast as jemalloc is, stack allocation is always faster. This has always been an aspect in which Java could never compete with C code. Fortunately, LWJGL offers a solution for that too.

The solution is the org.lwjgl.system.MemoryStack class that exposes an API for stack allocations. The recommendation is that any small buffer/struct allocation that is shortly-lived, should happen via the stack API. This approach shines even in tight loops and shows how important strategy #3 is. Example code (using try-with-resources, one of the possible code styles):

// OpenGL code
try ( MemoryStack stack = stackPush() ) { // thread-local lookup
    IntBuffer ip = stack.mallocInt(1); // ip instance eliminated
    glGetIntegerv(GL_NUM_COMPRESSED_TEXTURE_FORMATS, ip);

    IntBuffer formats = stack.mallocInt(ip.get(0)); // formats instance eliminated
    glGetIntegerv(GL_COMPRESSED_TEXTURE_FORMATS, formats);

    // ...
} // stack is automatically popped, buffers are freed

Applicable to structs as well:

// Vulkan code
try ( MemoryStack stack = stackPush() ) {
    VkCommandBufferBeginInfo cbi = VkCommandBufferBeginInfo.mallocStack(stack)
        .sType(VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO)
        .pNext(NULL)
        .flags(0)
        .pInheritanceInfo(
            VkCommandBufferInheritanceInfo.mallocStack(stack) // nested struct
                .sType(VK_STRUCTURE_TYPE_COMMAND_BUFFER_INHERITANCE_INFO)
                .pNext(NULL)
                .renderPass(VK_NULL_HANDLE)
                .subpass(0)
                .framebuffer(VK_NULL_HANDLE)
                .occlusionQueryEnable(false)
                .queryFlags(0)
                .pipelineStatistics(0)
        );

    int err = vkBeginCommandBuffer(cmd, cbi);
    check(err);

    // ...
} // structs are automatically freed

Strategy #6

Offer tools for debugging.

Having great performance is nice, but bugs related to explicit memory management can be hard to track down. LWJGL helps with:

  • Basic checks. These are enabled by default and ensure basic invariants with respect to API usage. The most important are related to buffers that cannot be NULL and buffer sizes that must be within certain ranges. This applies to both method parameters and struct members.
  • A debug mode. When enabled, LWJGL performs additional checks, that may be too expensive to perform by default.
  • A debug memory allocator. When enabled, all allocations are tracked and any leaks are reported when the JVM process ends. Each leak includes how many bytes where allocated and where exactly (with a full stack trace). Tracking is enabled for allocations via MemoryUtil as well as native-to-Java callback instances.
  • A stack allocation debugger. When enabled, LWJGL reports methods that have a broken symmetry between stack push & pop.

Summary

All of the above strategies contribute to the final "look and feel" of writing code for LWJGL, which can be summed up as: clean and efficient.

The greatest insight we gained while working on LWJGL 3 is that escape analysis works fantastically well in Java 8+. After several iterations, the final API does very little to encourage buffer instance reuse; it's simply not needed. Writing clean Java code, short and simple, does wonders not only for pure Java, but also when interacting with native libraries.

There are several, very useful native libraries out there. Waiting for a pure Java port or an abstraction in the JDK just because of how ugly and inefficient JNI code usually is, should be a thing of the past. LWJGL's approach works right now. Soon, Project Panama will eliminate most (if not all) remaining overhead and together with value types and any-generics will provide an extra level of type safety, bringing to Java a very strong FFI solution.