Quantcast
Channel: Tech – James' World
Viewing all articles
Browse latest Browse all 190

Solving Java GC Pause Outages in Production

$
0
0

Java Duke
Just thinking about howto configure HAProxy with two backend Java servers to be HA.

Java programs do pauses for garbage collection, known as “GC Pauses.”

The description “Stop the World” (STW) illustrates their true severity – they are a slow-motion train wreck for incoming requests.

If you’re new to this topic, please read:

Willy: “I work with people who use a lot of Java applications, and I’ve seen them spend as much time on tuning the JVM as they spend writing the code, and the result is really worth it.” Anybody have some extra time? 😐

My operational requirements for Java in production are:

  1. understand GC pause activity for my application servers
  2. control GC pause activity to a reasonable and bounded extent
  3. configure HAProxy load balancer to not send requests to servers undergoing GC pauses (ie. don’t lose requests)
  4. use an affordable amount of RAM to accomplish the above, preferably 8 or 16 GB in a shared VM environment.

1. Understand GC pause activity for my application servers

Detailed GC logging can be enabled with:

-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

and you can specify a separate GC log with:

-verbose:gc -Xloggc:/tmp/gc.log

See “Understanding Garbage Collection Logs.”

2. Control GC pause activity to a reasonable and known extent

One of the biggest challenges is to control the frequency and duration of GC pauses …

Some configuration approaches:

  • set heap size and compaction percent only somewhat above need. That will cause GCs to be more frequent, but also faster or the opposite …
  • set heap size to large amount and compaction to 100%, then trigger GC after hours
  • investigate alternate JVMs.

An example of some of the tuning options:

java -Xms512m -Xmx1152m -XX:MaxPermSize=256m -XX:MaxNewSize=256m MyClass.java

JRockit JVM: Tuning For a Small Memory Footprint
Tuning Java Virtual Machines (JVMs)
Weblogic Tuning JVM Garbage Collection for Production Deployments

Some programming approaches:

  • use streaming file IO with Files.lines() instead of reading into a String or hashmap, or use memory-mapped files
  • rewrite portions of your application to correctly use StringBuffer instead of String
  • Reduce object copies – if you do not have a problem with thread safety, then you don’t need immutable objects.
  • call dispose() method when available, such as SWT image class
  • for HashMaps, call clear() to re-use the memory later, but set to null to GC it
  • split java server into real-time and batch servers where possible with appropriate heap sizes.

3. Configure HAProxy load balancer requests to not be sent to servers undergoing GC pause events

This is tricky for several reasons:

  • health checks can be passive or active. Both have check gaps that won’t notice a GC starting before a request is sent
  • even if GC notifications are enabled and the server health check is red, HAProxy will not know (see above)
  • even if GC notifications are enabled and the server health check is now green, HAProxy will not know (see above) :)
  • the HAProxy options log-health-checks and redispatch may be helpful

a) I think the only 100% reliable way is to coordinate from the HAProxy side:

  1. understand your GC pattern
  2. use HAProxy socket interface to drain, then disable one backend
  3. wait for zero connections
  4. force a GC (easier said than done in Oracle Java since System.gc() is only a request for GC), or restart the Java server
  5. use HAProxy socket interface to enable the Java server.

This method would be risky with two Java servers, since during maintenance on one server, the other could GC pause. (facepalm)

b) Another possible approach would be to handle MemoryPoolMXBean MEMORY_THRESHOLD_EXCEEDED events. Maybe that can be used to update the health check on the server side and send a drain socket request to HAProxy if you reliably had advance notice and could force a GC now, trying the Java Tool Interface ForceGarbageCollection()?

c) And another idea is to write a sentinel file every 250 ms, and if it reaches 750 ms, assume a GC is happening and drain HAProxy. Unfortunately the TI events GarbageCollectionStart() and GarbageCollectionEnd() are sent after the VM is stopped, so you’re limited in what you can do when you need the most flexibility.

Some Java 8 Classes related to GC notifications:

  1. MemoryPoolMXBean – “The memory usage monitoring mechanism is intended for load-balancing or workload distribution use. For example, an application would stop receiving any new workload when its memory usage exceeds a certain threshold. It is not intended for an application to detect and recover from a low memory condition.”
  2. GarbageCollectionNotificationInfo
  3. GarbageCollectorMXBean

Also, investigate mod_jk and AJP. tomcat uses the same heap as your application, so tuning is very important here too.

4. Use an affordable amount of RAM to accomplish the above, preferably 8 or 16 GB in a shared VM environment

If you work in a VM consolidation environment, it’s important to minimize the footprint of your applications. Requesting an entire server to run a bloated app just isn’t going to cut it. See above for rewriting applications to minimize heap and GCs.

Garbage Collection JMX Notifications Example Code
Blade: A Data Center Garbage Collector
How to Tame Java GC Pauses? Surviving 16GiB Heap and Greater
SO: Garbage Collection Notifications
Letting the Garbage Collector Do Callbacks
HAProxyController.java
How to force garbage collection in Java?
SSL Termination, Load Balancers & Java
Github: Measuring Java Memory Consumption – sample code
Java is not “angry” with you.
Set State to DRAIN vs set weight 0
Scalable web applications [with Java]
Examples of forcing freeing of native memory direct ByteBuffer has allocated, using sun.misc.Unsafe?
Lucene ByteBuffer sample code
Improve availability in Java enterprise applications
The Four Month Bug: JVM statistics cause garbage collection pauses
Memory management when failure is not an option

Cassandra-related

CASSANDRA-5345: Potential problem with GarbageCollectorMXBean


Viewing all articles
Browse latest Browse all 190

Trending Articles