The first suspicious piece of code that draw our attention was CodeNarc. CodeNarc is a source code quality analysis tool for Groovy, which is used by a lot of Groovy developers, including in Gradle itself (since Gradle intensively makes use of Groovy). CodeNarc can be seen as the equivalent of FindBugs for Groovy code. And problems seemed to start with an upgrade of the Gradle CodeNarc plugin to use CodeNarc 0.23. We actually saw reports like this one in the forums or this other one in GitHub but thought that the PermGen error was just a consequence of CodeNarc including more rules: rules are written in Groovy, so compiled down to classes and classes eat PermGen. So increasing the PermGen space was enough, and it did actually solve the error. Problem solved. Or not. The riddle only started for me with a seemingly insignificant question on our internal mailing lists: "Can some investigate why our build sometimes fails with a PermGen space error?", and I volunteered.
Interestingly, I had just finished pushing an upgrade of Gradle to Groovy 2.4.4 on our master branch, and I had noticed that I had to increase the PermGen space too. At first, I naively thought that it was also required because Groovy 2.4 consumed more memory, but I was wrong. I should have known, because before joining Gradle, I had actually worked on Groovy for Android, and a consequence of this work was that Groovy 2.4 had a reduced memory footprint: we generate less bytecode, which directly relates to a reduced PermGen space usage. So why on earth would Groovy 2.4 require more memory? And what is the relation with the CodeNarc plugin? Actually this plugin works in a Gradle version that uses Groovy 2.3.10, so why would there be a relation between the two?
In such a case, your best friend is a profiler. But as I will explain here, it can also lead you to wrong tracks. Be careful. The second best friend is the JVM options -XX:+TraceClassLoading
and +XX:+TraceClassUnloading
. I also used -XX:SoftRefLRUPolicyMSPerMB=0
, an option that I had no idea it existed before my friend David Gageot told me. Basically, it will force the garbage collector to agressively collect all soft references, which is very useful to understand, in combination with the 2 other options, from which classloader we are leaking memory.
The first wrong track was actually thinking that CodeNarc was the source of the problem. I wrote an email on the gradle-dev list, explaining my findings, and I had indeed found a lot of classes from CodeNarc were not unloaded. Before I go further, let’s explain how the JVM is supposed to behave with regards to classes in Java 7. We all know that objects are garbage collected, but for a lot of people, classes are not. That’s why we have the PermGen space (which has disappeared in Java 8 but that’s another story): this segregated space of the JVM memory is used to store classes. And in Java, a class is loaded by a classloader. There is a strong reference between a class and its classloader. But what the JVM is able to do is actually simple: if there’s no instance of the class which is strongly reachable and that the classloader is neither strongly reachable, then both the class and the classloader can be unloaded. This means that PermGen can be recovered, and it is pretty useful, especially for a language like Groovy which can generate a lot of classes at runtime.
In Gradle, and particularily in the Gradle CodeNarc plugin, CodeNarc is executed through an Ant task, which spawns its own isolated classloader, containing both the CodeNarc and Groovy classes. So when the plugin execution is finished, if we do not keep track of the classloader, classes should be garbage collected. So a good candidate for the memory leak was actually the IsolatedAntBuilder
that Gradle uses to execute the Ant task. And guess what? There is such a leak, because the DefaultIsolatedAntBuilder
performs classloader caching! That was also discovered by my colleague Sterling, who immediately spot that: while we do cache the classloaders, keeping a strong reference on them, we don’t have any code to release the classloader in case of memory pressure. Conclusion, we’ve found the memory leak, hurray! And it has nothing to do with CodeNarc or Apache Groovy, pfiew!
So I immediately tried to disable the cache, which turned to be pretty trivial. Run the build again and… another PermGen space error. No CodeNarc classes unloaded, no Groovy classes unloaded. Wow. So the problem wasn’t solved, first "oh my!" moment of the week: there was another leak.
One test I did, then, is to totally comment out the code that, in the ant builder code, performed the definition of the CodeNarc task. Eventually, there was only that code left in the CodeNarc plugin:
@TaskAction
void run() {
logging.captureStandardOutput(LogLevel.INFO)
def classpath = new DefaultClassPath(getCodenarcClasspath())
antBuilder.withClasspath(classpath.asFiles).execute {
// ... thou shalt not leak!
}
}
I executed the code again, and there was definitely a leak: after several loops, a PermGen error occurred. CodeNarc was ruled out as the source of the leak. After some hours of trials, study of memory snapshots, I eventually came out with a piece of code that reproduces the problem independently of Gradle:
int i = 0;
try {
while (true) {
i++;
URLClassLoader loader = new URLClassLoader(
new URL[]{new File(GROOVY_JAR).toURI().toURL()},
ClassLoader.getSystemClassLoader().getParent());
Class system = loader.loadClass("groovy.lang.GroovySystem");
system.getDeclaredMethod("getMetaClassRegistry").invoke(null);
loader.close();
}
} catch (OutOfMemoryError e) {
System.err.println("Failed after " + i + " loadings");
}
As you can see, the code is very simple: it creates a new isolated classloader, which only contains Groovy on classpath. Then it invokes the creation of the Groovy runtime, by asking the metaclass registry, then it closes the classloader. On my JVM, after about 40 runs, the code fails with a PermGen space error: despite the fact that no class, no object is kept out of the classloader, the Groovy runtime is not unloaded, leading to a memory leak. The key point here is that I had noticed some oddities during my hours of debugging: a class, named ClassInfo
, was at the center of those oddities.
In particular, although YourKit (the profiler I was using) was telling me that all classes, all classloaders were weakly or softly reachable (btw, it’s really a pity that YourKit doesn’t show them separately, that is, weakly referenced objects from softly referenced ones), the classes were not garbage collected. Also strangely, some classes appeared as GC roots, meaning they could be collected, but they weren’t! And when I navigated through some of the duplicate classes I was seeing, ClassInfo
was present, as a value of a map entry of class value. Here we are, I had found the real source of the leak. Something had changed in Groovy. And the fact that CodeNarc was leaking since 0.23 was just a side effect of upgrading its dependency to Groovy 2.4! So despite Gradle was using Groovy 2.3.10, CodeNarc, by default, was using a more recent version of Groovy. That doesn’t explain why it leaks yet, but we now know who is responsible: Groovy.