To OOME, or not to OOME

“Hello? Contegix?” “Good morning. We’ve observed an OutOfMemoryError on stylefeeder01. We’ve restarted the web application and it seems to be running fine.” “Thanks for taking care of this for us,” I’d say. At this point, I’d start digging through log files to find out what had happened.

We’ve certainly had our share of Java Garbage Collection fun here at StyleFeeder, from heap and PermGen exhaustion, to memory leaks in libraries we’ve used; we’ve even been smacked over the head with the poor choice of NewRatio default for the x86-64 platform. But, these new OOMEs we were seeing seemed especially odd. The (Hotspot 1.5.0_09) JVM would log a PermGen OOME (“java.lang.OutOfMemoryError: PermGen space”), then proceed like normal, running a Full GC shortly thereafter. The GC log would indicate that, yes, PermGen was temporarily full, but, no, there was no real lack of space—the Full GC had no trouble clearing up over half of the available PermGen space. It was almost as if the JVM was yelling for help before it even tried to clean up its own mess.

At this point, we were using the default server GC, the throughput parallel collector. The parallel collector has two types of cleanups: scavenges and Full GCs. When the young generation fills up, a scavenge is fired off, cleaning up the young generations of the heap. When the old generation or permanent generations fill up, a Full GC is run, which cleans up all memory areas under GC control. Or, at least, that’s my understanding of what’s supposed to happen. In our case, when the permanent generation filled up, the GC would first fire off an OOME, then run a Full GC.

This smelled like a bug in the JVM. We weren’t using the latest version of Hotspot and were a little hesitant to upgrade, so we tried something different: the mark and sweep garbage collector. We had been thinking about using mark and sweep to avoid the occasional long pauses that are characteristic of the parallel collector. So, we gave it a whirl, making sure to turn on the options that would do as much permanent generation cleaning as possible:

-XX:+UseConcMarkSweepGC -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled

We waited… and waited… and waited… At this point, it’s been three weeks without an OOME, so I think we can safely declare the change a success.