29 August 2015

Tags: jvm groovy gradle permgen yourkit

A challenge at Gradle

A new job

This is my first blog post after 4 months being a full time employee of Gradle Inc.! I am pretty excited by this job, even though so far it didn’t give me much time to contribute to Apache Groovy (but we did manage to release 2.4.4 though). One of the reasons I joined the company was because I love technical challenges. And Gradle has a lot of them, with an incredible team of smart people working together to make software automation better. This week, I worked on my first true challenge, and I must confess that I miserably failed :-) This is a long post, and anyone who ever fought against the infamous "PermGen space error" in their application is going to understand why…​

CodeNarc as the source of a leak?

The first suspicious piece of code that draw our attention was CodeNarc. CodeNarc is a source code quality analysis tool for Groovy, which is used by a lot of Groovy developers, including in Gradle itself (since Gradle intensively makes use of Groovy). CodeNarc can be seen as the equivalent of FindBugs for Groovy code. And problems seemed to start with an upgrade of the Gradle CodeNarc plugin to use CodeNarc 0.23. We actually saw reports like this one in the forums or this other one in GitHub but thought that the PermGen error was just a consequence of CodeNarc including more rules: rules are written in Groovy, so compiled down to classes and classes eat PermGen. So increasing the PermGen space was enough, and it did actually solve the error. Problem solved. Or not. The riddle only started for me with a seemingly insignificant question on our internal mailing lists: "Can some investigate why our build sometimes fails with a PermGen space error?", and I volunteered.

Interestingly, I had just finished pushing an upgrade of Gradle to Groovy 2.4.4 on our master branch, and I had noticed that I had to increase the PermGen space too. At first, I naively thought that it was also required because Groovy 2.4 consumed more memory, but I was wrong. I should have known, because before joining Gradle, I had actually worked on Groovy for Android, and a consequence of this work was that Groovy 2.4 had a reduced memory footprint: we generate less bytecode, which directly relates to a reduced PermGen space usage. So why on earth would Groovy 2.4 require more memory? And what is the relation with the CodeNarc plugin? Actually this plugin works in a Gradle version that uses Groovy 2.3.10, so why would there be a relation between the two?

In such a case, your best friend is a profiler. But as I will explain here, it can also lead you to wrong tracks. Be careful. The second best friend is the JVM options -XX:+TraceClassLoading and +XX:+TraceClassUnloading. I also used -XX:SoftRefLRUPolicyMSPerMB=0, an option that I had no idea it existed before my friend David Gageot told me. Basically, it will force the garbage collector to agressively collect all soft references, which is very useful to understand, in combination with the 2 other options, from which classloader we are leaking memory.

The first wrong track was actually thinking that CodeNarc was the source of the problem. I wrote an email on the gradle-dev list, explaining my findings, and I had indeed found a lot of classes from CodeNarc were not unloaded. Before I go further, let’s explain how the JVM is supposed to behave with regards to classes in Java 7. We all know that objects are garbage collected, but for a lot of people, classes are not. That’s why we have the PermGen space (which has disappeared in Java 8 but that’s another story): this segregated space of the JVM memory is used to store classes. And in Java, a class is loaded by a classloader. There is a strong reference between a class and its classloader. But what the JVM is able to do is actually simple: if there’s no instance of the class which is strongly reachable and that the classloader is neither strongly reachable, then both the class and the classloader can be unloaded. This means that PermGen can be recovered, and it is pretty useful, especially for a language like Groovy which can generate a lot of classes at runtime.

In Gradle, and particularily in the Gradle CodeNarc plugin, CodeNarc is executed through an Ant task, which spawns its own isolated classloader, containing both the CodeNarc and Groovy classes. So when the plugin execution is finished, if we do not keep track of the classloader, classes should be garbage collected. So a good candidate for the memory leak was actually the IsolatedAntBuilder that Gradle uses to execute the Ant task. And guess what? There is such a leak, because the DefaultIsolatedAntBuilder performs classloader caching! That was also discovered by my colleague Sterling, who immediately spot that: while we do cache the classloaders, keeping a strong reference on them, we don’t have any code to release the classloader in case of memory pressure. Conclusion, we’ve found the memory leak, hurray! And it has nothing to do with CodeNarc or Apache Groovy, pfiew!

So I immediately tried to disable the cache, which turned to be pretty trivial. Run the build again and…​ another PermGen space error. No CodeNarc classes unloaded, no Groovy classes unloaded. Wow. So the problem wasn’t solved, first "oh my!" moment of the week: there was another leak.

One test I did, then, is to totally comment out the code that, in the ant builder code, performed the definition of the CodeNarc task. Eventually, there was only that code left in the CodeNarc plugin:

    @TaskAction
    void run() {
        logging.captureStandardOutput(LogLevel.INFO)
        def classpath = new DefaultClassPath(getCodenarcClasspath())
        antBuilder.withClasspath(classpath.asFiles).execute {
            // ... thou shalt not leak!
        }
    }

I executed the code again, and there was definitely a leak: after several loops, a PermGen error occurred. CodeNarc was ruled out as the source of the leak. After some hours of trials, study of memory snapshots, I eventually came out with a piece of code that reproduces the problem independently of Gradle:

int i = 0;
try {
    while (true) {
        i++;
	URLClassLoader loader = new URLClassLoader(
            new URL[]{new File(GROOVY_JAR).toURI().toURL()},
            ClassLoader.getSystemClassLoader().getParent());
        Class system = loader.loadClass("groovy.lang.GroovySystem");
        system.getDeclaredMethod("getMetaClassRegistry").invoke(null);
        loader.close();
    }
} catch (OutOfMemoryError e) {
    System.err.println("Failed after " + i + " loadings");
}

As you can see, the code is very simple: it creates a new isolated classloader, which only contains Groovy on classpath. Then it invokes the creation of the Groovy runtime, by asking the metaclass registry, then it closes the classloader. On my JVM, after about 40 runs, the code fails with a PermGen space error: despite the fact that no class, no object is kept out of the classloader, the Groovy runtime is not unloaded, leading to a memory leak. The key point here is that I had noticed some oddities during my hours of debugging: a class, named ClassInfo, was at the center of those oddities.

ClassInfo leak

In particular, although YourKit (the profiler I was using) was telling me that all classes, all classloaders were weakly or softly reachable (btw, it’s really a pity that YourKit doesn’t show them separately, that is, weakly referenced objects from softly referenced ones), the classes were not garbage collected. Also strangely, some classes appeared as GC roots, meaning they could be collected, but they weren’t! And when I navigated through some of the duplicate classes I was seeing, ClassInfo was present, as a value of a map entry of class value. Here we are, I had found the real source of the leak. Something had changed in Groovy. And the fact that CodeNarc was leaking since 0.23 was just a side effect of upgrading its dependency to Groovy 2.4! So despite Gradle was using Groovy 2.3.10, CodeNarc, by default, was using a more recent version of Groovy. That doesn’t explain why it leaks yet, but we now know who is responsible: Groovy.

ClassValue, friend or foe?

Since Groovy 2.4, Groovy uses a new mechanism for storing its runtime metadata in case you run on JDK 7 or later: ClassValue. ClassValue allows storing information on the class level. Typically, a language like Groovy would use it to store the metaclass of a class. In practice, Groovy doesn’t directly store the metaclass here, but a higher level concept called ClassInfo, which in turns gives access to the metaclass of a class.

Before Groovy 2.4, all that information was stored directly in the ClassInfo class itself, through a private static field. ClassInfo is therefore the global entry point for accessing runtime information about a class. While Groovy 2.4 still uses ClassInfo as an entry point, there are 2 possible storage mechanisms, based on the underlying JDK. If ClassValue is available; which is the case for any JDK 7+, then it is used, otherwise we fallback on the old mechanism. ClassValue is supposed to be more efficient and more direct. Of course, in any case (old and new storage mechanism), it is memory sensitive: in case a class is not available anymore, its ClassInfo is removed. What ClassValue storage provides is basically the same as ThreadLocal, but at the Class level instead of the Thread level. Users are allowed to store information here, but one should be aware that as thread locals, if you start using it, you may face memory leaks if you don’t use it properly.

That’s for the theory, let’s see how in practice this change led to a giant memory leak in the Gradle build.

The theory is that ClassValue should behave like ThreadLocal. That is, the entries stored in the internal map of the Class class, should be garbage collected when the referent is not strongly referenced anymore. This behavior is however not what the JVM does. It was confirmed to me by Charles Nutter (JRuby) a few hours later: although we all expected the JVM to collect the unreachable, it does not.

Uh. Second "oh my!" moment of the week. I had now a candidate (Groovy) and a reason (ClassInfo leaking). However, it doesn’t explain by itself why the class loader is not garbage collected: if Groovy stores information on classes, it’s ok, as long as the classes to which it writes some metadata are from the classloader, or any child classloader, of the Groovy runtime itself. Everything would be self-contained, meaning we would have a graph of objects that do not leak outside of the isolated classloader. However…​ Groovy uses Strings, integers, List, …​ all coming from the system classloader. And that is the main difference with the old metadata storage mechanism: the old one only referenced classes from the system classloader. With ClassValue, we are modifying classes from the system classloader too! That is, the String class, for example, contains in its class value map, information from the Groovy runtime! The famous ClassInfo instance is present there! There we are! We leaked a ClassInfo instance into the system classloader! So what happens is that when we are done with our "isolated" Groovy runtime, we think it should unload because nothing references any object or class from that classloader. However, the Groovy runtime did update classes from the system classloader, and it started leaking into it! The ghost in the shell! Groovy is spoiling everywhere…​

So far so good, I had the explanation, I could write a workaround: let’s iterate over all those classes that Groovy updated, remove the ClassInfo, and we’re done. I wrote that code, and it turned out to be a bit ugly, but…​ it worked! Here is, for information, the cleanup code:

    static void removeClassFromGlobalClassSet(Class<?> classInfoClass) throws Exception {
        Field globalClassValueField = classInfoClass.getDeclaredField("globalClassValue");
        globalClassValueField.setAccessible(true);
        Object globalClassValue = globalClassValueField.get(null);
        Method removeFromGlobalClassValue = globalClassValueField.getType().getDeclaredMethod("remove", Class.class);
        removeFromGlobalClassValue.setAccessible(true);

        Field globalClassSetField = classInfoClass.getDeclaredField("globalClassSet");
        globalClassSetField.setAccessible(true);
        Object globalClassSet = globalClassSetField.get(null);
        globalClassSetField = globalClassSet.getClass().getDeclaredField("items");
        globalClassSetField.setAccessible(true);
        Object globalClassSetItems = globalClassSetField.get(globalClassSet);

        Field clazzField = classInfoClass.getDeclaredField("klazz");
        clazzField.setAccessible(true);


        Iterator it = (Iterator) globalClassSetItems.getClass().getDeclaredMethod("iterator").invoke(globalClassSetItems);

        while (it.hasNext()) {
            Object classInfo = it.next();
            Object clazz = clazzField.get(`ClassInfo`);
            removeFromGlobalClassValue.invoke(globalClassValue, clazz);
        }

    }

After executing that code, no ClassInfo instance was leaking anymore into the system classloader, and the runtime could be shutdown properly. The garbage collector did its job, and yay! I’m so happy, I’ll be able to sleep soon! That was tuesday night. And that night, I thought I had found the solution.

Memory sensitive classloader caching

So wednesday, I spent the day trying to implement the same strategy inside Gradle. More precisely, inside the IsolatedAntBuilder thing I told you. I implemented the code, launched my test again and hurray! it worked! My test passed! No more PermGen space error! So all I had to do, now, was to reactivate classloader caching, otherwise, we would loose a feature that is important performance wise.

So I reactivated the cache, and boom! That time, the Gradle build did not fail with a PermGen error, but with very strange errors like this one:

groovy.lang.MissingMethodException: No signature of method: java.lang.Integer.plus() is applicable for argument types: (java.lang.Integer) values: [0]
>  Possible solutions: plus(java.lang.String), plus(java.lang.Character), abs(), use([Ljava.lang.Object;), split(groovy.lang.Closure), minus(java.lang.Character)

Mmmmmm…​ 3rd "oh my!" moment of the week. I understood what I had just done. By clearing the ClassInfo stuff from the classloader, I had effectively shutdown the Groovy runtime that was initiated in that cached classloader. So when some code was trying to reuse the runtime from that cached classloader, since I had disabled it, it was failing! And there’s no option to reinitialize the Groovy runtime. It’s just not doable, because everything happens in static initializers (private final fields, …​). So unless the JVM had an option to allow to re-execute the static initializers of a class (and who knows what oddities it would lead to), I had no luck.

That’s about when I told my mates at Gradle "I think we have to choose between caching and leaking memory". But the night came, and I actually had an idea. I could implement a memory sensitive cache: by writing a smart cache structure with appropriate SoftReferences and reference queues, I would be able to execute the shutdown code only when I know that the GC is trying to reclaim memory. The idea is simple: we have a map, which key is a SoftRefence<String>, and the value is our cached classloader. The String represents the classpath that we are caching for the classloader.

Now imagine that the GC is out of memory. The semantics of SoftReference are clear: before throwing an OutOfMemoryError, the JVM will do its best and clear all soft references. Doing so, using a custom reference queue, we can be notified that the reference is cleared. Then, we can execute the Groovy runtime shutdown code, which will in turn make the ClassLoader collectible.

Honestly I was pretty happy with my implementation. I executed the code and it worked! Caching was working until the GC tried to reclaim memory, then I saw my shutdow code executed, memory reclaimed and green tests. Woooo!!! I had eventually knocked that memory leak down! Ha ha!

Then I remembered that my colleague Sterling had a test which involved a loop in an integration test. To make sure I had really fixed the leak, I asked him to tell me how he did that. The code was very simple, just involving a loop thanks to @Unroll in a Spock specification. I did it and…​ PermGen error showed up again. WAT?!

That was the fourth "oh my!" moment. And not the last one. I really then spent hours in modifying my caching code, refactored my code to add more complicated memory leak strategies, seeing that there were still thread locals, clearing them explicitly, adding a memory leak strategy for Ant itself, for the Java Beans introspector, …​ None of my attempts worked. In the end, it always failed. But there was always one mystery: I saw that the Groovy classes were unloading. But the Ant classes were not…​ And the rest, I should have discovered that much sooner. But when you have so many potential source leaks, that are much more evident, it’s so hard to figure out.

In particular, one thing would have made things much easier to discover. In YourKit, you can see that there are duplicate classes. Classes that have the same name, but come from different class loaders. However, there’s nothing that will show you those duplicates. You have to find them yourself. And in the end, when in the dump you see an instance of that class, all you can see is that it is an instance of ClassInfo. Nothing, visually, tells you that the instance of ClassInfo that you are seeing actually comes from a different classloader from the one you are seeing just next to it. A bit of color, for example, would help…​. And it would have helped me seeing that some ClassInfo elements that I was seeing in the classes from Ant didn’t come from the "disposable" Groovy runtime…​ No. They were coming from…​ the Gradle runtime itself!

Where it all ends

Ok, That was the last "oh my!" moment of the week. The one that killed all my hopes. And to understand the problem, I now have to explain to you how IsolatedAntBuilder works. It’s a very small, yet very smart and practical piece of code. Maybe too smart.

Gradle, as a core feature, lets you execute Ant tasks thanks to code inherited from the Groovy codebase itself: AntBuilder. It’s a very elegant piece of code, that lets you write things like:

task check << {
    ant.taskdef(name: 'pmd',
                classname: 'net.sourceforge.pmd.ant.PMDTask',
                classpath: configurations.pmd.asPath)
    ant.pmd(shortFilenames: 'true',
            failonruleviolation: 'true',
            rulesetfiles: file('pmd-rules.xml').toURI().toString()) {
        formatter(type: 'text', toConsole: 'true')
        fileset(dir: 'src')
    }
}

While this works, there’s actually a lot involved behind that. Including classloader magic. In particular, in the example above, we create a task definition in Ant, which uses a classpath defined in Gradle. "ant" here is a global object which is shared accross the build, but it is possible to avoid the classes from the Ant tasks to be mixed with the Gradle classpath itself by using antBuilder instead. That’s what the CodeNarc plugin does:

antBuilder.withClasspath(classpath.asFiles).execute {
}

means "Gradle, please, create an isolated classloader for me, that will contain the classpath only necessary for CodeNarc, and execute that Ant task with it". It seems very trivial, but there is a problem. The code that you see here is found in a Gradle script. It means that the "antBuilder" object that you are seeing here comes from Gradle. It is our IsolatedAntBuilder instance. When we call "withClasspath", a new instance of IsolatedAntBuilder will be created, with an isolated classloader corresponding to the supplied classpath. Then calling execute with a closure that lets you configure the ant task using the Groovy AntBuilder syntax.

So the "Closure" class that we are seeing here comes from Gradle. Then, we have a classloader which contains the Ant runtime, and a "bridge" class, written in Groovy, called "AntBuilderDelegate", which has one responsibility: when the code of the Ant builder is going to be executed, it is likely that the version of Groovy which will be found on classpath will be different from the one that Gradle uses. That is exactly what happens with CodeNarc: Gradle 2.6 uses Groovy 2.3.10, but the CodeNarc plugin executes with Groovy 2.4.1, so the Ant task works with a different "Closure" class than the one that Gradle has. We will really have two distinct "Closure" classes here, and "AntBuilderDelegate" is responsible for filling the gap: when the Ant configuration code, which will use AntBuilder from the Ant classpath, is going to be executed, it will be calling AntBuilderDelegate instead of directly the Closure code. And that code will intercept the missing methods in order to "reroute" them to the builder. You don’t have to understand that in detail, it’s not really the point here, but it is important to understand that this "AntBuilderDelegate" class is instantiated…​ in Gradle, using the Gradle classloader.

Now you may see it coming. I told you I had upgraded Gradle to use Groovy 2.4.4. So what does it mean? Gradle now uses Groovy with ClassValue. And what is the problem with ClassValue? All classes "touched" by Groovy will have them "polluted" with necessary metadata information. So when we create an instance of "AntBuilderDelegate", we’re doing that using the Groovy 2.4 runtime from Gradle, which comes with its own ClassInfo. And that delegate references and AntBuilder which is instantiated using the Ant classloader, with a different Groovy runtime, having its own ClassInfo. So what I had found earlier was that the ClassInfo from the Groovy "Ant" runtime was leaking into Gradle. But I hadn’t realized that the opposite was also true! By bridging the runtimes, we were leaking Groovy "Gradle" into the isolated classloader, through ClassValue!

So, what happened, is that the Groovy classes from the isolated classpath were garbage collected, because no ClassInfo from Gradle leaked into them. However, the Ant classes were touched. And they were NOT collectible then.

And this is were I stopped. Because if I found a way to "unload" ClassInfo from the isolated classpath and the touched classes from the system classloader, I haven’t found a way to do the same for the ClassInfo instances that leak into the Ant runtime…​ Of course I tried a "brute force" thing, by removing all ClassInfo from those classes, but as you understood, it’s a desperate attempt: it’s equivalent to shutting down the runtime. And then, it totally breaks subsequent calls in the Gradle build, we’ve just broken the Groovy runtime from Gradle…​

To add some confusion to the problem, I think it’s now a good time to explain that I actually simplified the isolated ant builder classloader hierarchy. There are actually (at least) 3 classloaders involved:

  • The classloader from Gradle, which loads the Gradle runtime, the IsolatedAntBuilder instance and also the AntBuilderDelegate instance

  • A classloader for the Ant runtime, which is isolated from Gradle, apart from logging classes necessary for Gradle to be able to capture the output. This classloader is per classpath, and is the one which is cached.

  • A classloader that is filtering some classes from the Gradle runtime classloader to make them available. This is what the bridging Ant builder uses. This classloader is shared among all isolated ant builder instances.

So when I say that something leaks, it can leak to any of those classloaders, and any of the parent loaders…​

That’s were I felt desperate. After a week fighting those memory leaks, and so many "ah ah, got you!" moments, I was in front of a wall. Basically, while in the first case (isolated ClassInfo leaking into Gradle), I know I can totally clean all the ClassInfo references because I know I can shutdown the Groovy runtime, in the second case (Gradle ClassInfo leaking), I basically have no idea that a class comes from the isolated classloader or not. So it’s hard to say if you should remove the ClassInfo or not. I am currently experimenting a brute force "try to determine if a class belongs in the class loader hierarchy", but it is weak (ah ah!) because I need to know about several potential class loader types.

What’s next?

So, what can we do next? One has to remember that fixing Groovy is not the ultimate solution because Gradle uses Groovy internally, but the various tasks can very well use a different Groovy version which is beyond our control.

  1. rollback Groovy in Gradle to use Groovy 2.3.10. It would avoid the Groovy classes from Gradle to leak into the Isolated classloaders, but is also unfortunate given the improvements that this version provides. Also, those who write Groovy applications for Android use Groovy 2.4+…​

  2. CodeNarc would still use Groovy 2.4+, we could downgrade it too. However, if people rely on features of Groovy 2.4+, they just have no choice, so we would still have the problem.

  3. Use smarter techniques like instrumentation to track the leakages of ClassInfo, record them, and revert when we’re done. It’s doable, but it’s a huge amount of work, and relying on instrumentation for Gradle would be very bad for performance.

  4. Update Gradle to use FilteringClassLoader everywhere, including in its main process, to prevent ClassValue to be found. This would work because without that class, Groovy wouldn’t use ClassValue to store the metadata and fall back to the old mechanism.

  5. Wait for a fix in Groovy. Jochen is already working on that, but we know that the old mechanism isn’t perfect either, and has memory leaks too. That was one of the reasons to migrate to ClassValue.

  6. Wait for a fix of the JVM. That’s beyond our control.

  7. Increase the PermGen space for builds that use the code quality plugins, which internally use AntBuilder. It’s what we do today. It works, but it’s just hiding the problem. And we have to explain to our users to do it too.

  8. Some smart people come with a smart solution, that’s why I wrote this too. During that week, I got help from lots of people, including Jochen "Groovy" Theodorou, Henri "EasyMock" Tremblay, David Gageot or Nikita Salnikov from Plumbr, thank you guys!

By the way, if you wonder, the same problem exist in JDK 8 too, it’s just not visible immediately because of the metaspace that appeared to replace the PermGen space.

Now, it was fun writing this "post-mortem", I hope it wasn’t too obscure, it helped me a lot because I had so many "got it" and "oh noes!" moments that I felt it was very interesting to share this story with you. And if you like technical challenges, do not forget that Gradle is hiring!

Update

After writing this post, I made great progress, and I did manage to get rid of the leak, but I also discovered that the leak happens with the various GroovyCompile tasks…​ Anyway you can follow my progress on this branch.