Java Performance Optimization
Performance optimization strategies for Java (Java 25 era) including profiling, JVM tuning, memory management, and caching.
Overview
Performance optimization requires measuring first, then targeting actual bottlenecks rather than premature optimization. This guide covers Java-specific techniques including profiling tools, memory management, JVM tuning, caching strategies, and virtual threads for I/O-bound workloads. Focus optimization efforts on the critical path - the code that executes most frequently or takes the longest time.
The cardinal rule of performance optimization is: measure first, optimize second. Developers often have incorrect intuitions about where performance bottlenecks exist. Profiling reveals the truth - what code actually consumes CPU time and allocates memory. Without profiling data, optimization is guesswork that often makes code more complex without improving performance.
Performance has multiple dimensions: throughput (requests per second), latency (response time), and resource utilization (CPU, memory, I/O). Optimizing for one may hurt others - reducing latency might decrease throughput due to synchronization overhead. Understand your performance goals before optimizing. For user-facing services, P95 and P99 latency matter more than average latency, because tail latency determines user experience.
Always measure before optimizing. Use profilers to identify actual bottlenecks, not perceived ones. Target P95 and P99 latency, not just averages. See Performance Testing for load testing strategies.
Core Principles
- Measure first: Profile before optimizing
- Optimize the critical path: Focus on hot code paths
- Memory efficiency: Minimize allocations and GC pressure
- Caching: Cache expensive computations and data lookups
- Connection pooling: Reuse expensive resources
- Async processing: Use virtual threads or CompletableFuture for I/O
- JVM tuning: Configure GC and memory settings appropriately
Profiling Tools
Profiling identifies where your application spends time and allocates memory. Use profilers to find hot spots (frequently executed code) and memory leaks before attempting optimization. Profiling in production-like environments reveals real-world performance characteristics.
JFR (Java Flight Recorder)
JFR is a built-in profiling tool with minimal overhead (<1%), making it safe for production use. It records events (method calls, allocations, GC) over time, providing insights into application behavior. JDK Mission Control visualizes JFR data.
# Enable JFR recording for 60 seconds
java -XX:StartFlightRecording=duration=60s,filename=recording.jfr \
-jar payment-service.jar
# Analyze with JDK Mission Control GUI
jmc recording.jfr
JFR Configuration:
// application.yml for Spring Boot
management:
jfr:
enabled: true
endpoints:
web:
exposure:
include: jfr
Async Profiler
# Download and run
wget https://github.com/async-profiler/async-profiler/releases/latest/download/async-profiler-linux-x64.tar.gz
tar -xzf async-profiler-linux-x64.tar.gz
# Profile running application
./profiler.sh -d 60 -f flamegraph.html <pid>
VisualVM
# Launch VisualVM
jvisualvm
# Connect to running application
# Monitor CPU, memory, threads, and take heap dumps
Memory Optimization
Object allocation is cheap in Java, but excessive allocation increases GC pressure. The JVM must pause to reclaim memory during garbage collection. Minimizing unnecessary allocations reduces GC frequency and pause times, improving throughput and latency.
Minimize Object Allocation
// BAD: String concatenation creates multiple intermediate String objects
public String formatPaymentDetails(Payment payment) {
return "Payment: " + payment.getId() +
", Amount: " + payment.getAmount() +
", Status: " + payment.getStatus();
// Each + creates a new String object - wasteful in loops
}
// GOOD: StringBuilder avoids intermediate objects
public String formatPaymentDetails(Payment payment) {
return new StringBuilder(100) // Pre-sized to avoid resizing
.append("Payment: ").append(payment.getId())
.append(", Amount: ").append(payment.getAmount())
.append(", Status: ").append(payment.getStatus())
.toString();
}
// GOOD: BETTER: Use String.format or formatted (Java 15+)
public String formatPaymentDetails(Payment payment) {
return "Payment: %s, Amount: %s, Status: %s"
.formatted(payment.getId(), payment.getAmount(), payment.getStatus());
}
Reuse Objects
Object allocation in Java is fast (thanks to generational garbage collection and thread-local allocation buffers), but creating millions of short-lived objects puts pressure on the garbage collector. The GC must pause application threads to reclaim memory, affecting latency. Reusing immutable objects eliminates this overhead.
Many Java classes are designed to be reused - DateTimeFormatter, Pattern (regex), and MessageDigest are thread-safe and should be cached as static final fields. Creating these objects is expensive (parsing patterns, initializing state), so reuse provides significant performance benefits.
// BAD: Creates new DateTimeFormatter on every call - expensive parsing
public String formatTimestamp(LocalDateTime timestamp) {
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");
return timestamp.format(formatter);
}
// GOOD: Reuse immutable formatter
public class PaymentFormatter {
private static final DateTimeFormatter TIMESTAMP_FORMATTER =
DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");
public String formatTimestamp(LocalDateTime timestamp) {
return timestamp.format(TIMESTAMP_FORMATTER);
}
}
Avoid Autoboxing in Loops
Autoboxing (converting primitives to wrapper objects) creates unnecessary objects. In tight loops, this generates significant garbage. Use primitive types directly to avoid allocation overhead.
// BAD: Autoboxing creates Integer objects on every iteration
public long sumPaymentAmounts(List<Payment> payments) {
Long total = 0L; // Wrapper type
for (Payment payment : payments) {
total += payment.getAmount().longValue(); // Unboxes Long, adds, boxes result
}
return total; // Creates millions of Long objects for large lists
}
// GOOD: Use primitive types to avoid boxing
public long sumPaymentAmounts(List<Payment> payments) {
long total = 0L; // Primitive type - no allocation
for (Payment payment : payments) {
total += payment.getAmount().longValue(); // No boxing
}
return total;
}
// GOOD: BETTER: Use streams with primitive specializations
public long sumPaymentAmounts(List<Payment> payments) {
return payments.stream()
.mapToLong(p -> p.getAmount().longValue())
.sum();
}
Collection Performance
Choose the Right Collection
Different collection implementations have different performance characteristics. Choosing the wrong collection can degrade performance by orders of magnitude. Understand your access patterns - random access vs sequential, frequent insertions vs reads, need for ordering, and thread safety requirements.
Performance characteristics:
ArrayList: O(1) random access, O(1) amortized append, O(n) insert/delete in middleLinkedList: O(1) insert/delete at ends, O(n) random accessHashMap: O(1) average case lookup/insert/deleteTreeMap: O(log n) lookup/insert/delete, maintains sorted orderConcurrentHashMap: O(1) average case with lock-free reads, thread-safe
// GOOD: ArrayList for random access and iteration - most common choice
List<Payment> payments = new ArrayList<>();
// GOOD: LinkedList for queue operations (frequent head/tail insertions)
Deque<Payment> paymentQueue = new LinkedList<>();
// GOOD: HashMap for key-value lookups - O(1) average case
Map<String, Payment> paymentMap = new HashMap<>();
// GOOD: HashSet for uniqueness checks - O(1) contains()
Set<String> processedIds = new HashSet<>();
// GOOD: TreeMap for sorted keys - O(log n) but maintains order
NavigableMap<LocalDateTime, Payment> paymentsByTime = new TreeMap<>();
// GOOD: ConcurrentHashMap for thread-safe access without external synchronization
Map<String, Payment> sharedCache = new ConcurrentHashMap<>();
Pre-size Collections
Collections dynamically resize when capacity is exceeded. ArrayList doubles its capacity, creating a new array and copying all elements - an O(n) operation. If you know the approximate size upfront, pre-sizing eliminates these expensive resize operations.
HashMap and HashSet resize when load factor (size / capacity) exceeds 0.75. Resizing involves creating a new internal array and rehashing all entries. Pre-sizing with capacity/0.75 ensures no resizing occurs during population. This is critical for large maps - resizing a million-entry map is noticeable.
// BAD: Default size (10), multiple resizes as it grows
List<Payment> payments = new ArrayList<>(); // Resizes at 10, 20, 40, 80, 160, 320, 640
for (int i = 0; i < 1000; i++) {
payments.add(createPayment()); // Expensive copying on resize
}
// GOOD: Pre-size to avoid resizes - single allocation
List<Payment> payments = new ArrayList<>(1000);
for (int i = 0; i < 1000; i++) {
payments.add(createPayment()); // No resizing needed
}
// GOOD: Map pre-sizing (accounting for 0.75 load factor threshold)
Map<String, Payment> payments = new HashMap<>((int) (1000 / 0.75 + 1)); // ~1334 capacity
// Will hold 1000 entries without resizing
Stream Performance
Streams provide elegant functional-style data processing, but they aren't always faster than loops. Streams have overhead (creating iterator, boxing primitives, lambda invocation) that can hurt performance on small datasets. Parallel streams add further overhead (splitting work, merging results, thread coordination) that only pays off for large datasets or CPU-intensive operations.
When to use parallel streams: CPU-bound operations on collections with 10,000+ elements where the operation on each element is non-trivial (e.g., complex calculations). When to avoid: Small collections, I/O-bound operations (use virtual threads instead), or when operations have side effects (parallel execution makes side effects unpredictable).
// GOOD: Parallel streams for CPU-intensive operations on large datasets
List<Payment> highValuePayments = payments.parallelStream()
.filter(p -> p.getAmount().compareTo(new BigDecimal("10000")) > 0)
.collect(Collectors.toList());
// Don't use parallel streams for small collections (overhead > benefit)
List<Payment> result = smallList.stream() // Use sequential for < 1000 items
.filter(predicate)
.collect(Collectors.toList());
// Use primitive streams to avoid boxing
long totalAmount = payments.stream()
.mapToLong(p -> p.getAmount().longValue())
.sum();
// BAD: Inefficient collection with counting
long count = payments.stream()
.filter(p -> p.getStatus() == PaymentStatus.COMPLETED)
.collect(Collectors.toList())
.size();
// GOOD: Use count() directly
long count = payments.stream()
.filter(p -> p.getStatus() == PaymentStatus.COMPLETED)
.count();
Caching Strategies
Caching is one of the most effective performance optimizations for Java applications, storing frequently accessed data in memory to avoid expensive recomputation or database queries.
For comprehensive Java caching guidance including Caffeine configuration, Spring Cache abstraction (@Cacheable, @CacheEvict), Redis distributed caching, and cache invalidation patterns, see the Caching Guide.
Java-specific caching considerations:
- Caffeine is the recommended in-process cache for Java (superior to Guava Cache)
- Use Spring Cache abstraction for declarative caching with
@Cacheableannotations - Monitor cache hit rates via
recordStats()to tune size limits and TTL values - Consider memory impact - L1 caches consume JVM heap space
Database Performance
Connection Pooling (HikariCP)
spring:
datasource:
hikari:
# Connection pool size
maximum-pool-size: 20
minimum-idle: 5
# Connection timeout
connection-timeout: 30000 # 30 seconds
idle-timeout: 600000 # 10 minutes
max-lifetime: 1800000 # 30 minutes
# Performance
auto-commit: false
read-only: false
# Leak detection
leak-detection-threshold: 60000 # 60 seconds
# Health check
connection-test-query: SELECT 1
Query Optimization
// BAD: N+1 query problem
public List<Payment> getPaymentsWithCustomers() {
List<Payment> payments = paymentRepository.findAll();
// Each payment.getCustomer() triggers a separate query
payments.forEach(p -> System.out.println(p.getCustomer().getName()));
return payments;
}
// GOOD: Use JOIN FETCH to load in single query
@Query("""
SELECT p FROM Payment p
JOIN FETCH p.customer
WHERE p.status = :status
""")
List<Payment> findPaymentsWithCustomers(@Param("status") PaymentStatus status);
// GOOD: EntityGraph for complex fetching
@EntityGraph(attributePaths = {"customer", "transactions"})
List<Payment> findByStatus(PaymentStatus status);
Batch Operations
// BAD: Individual inserts
public void savePayments(List<Payment> payments) {
for (Payment payment : payments) {
repository.save(payment); // N queries
}
}
// GOOD: Batch insert
@Transactional
public void savePayments(List<Payment> payments) {
repository.saveAll(payments); // Batched
repository.flush();
}
// Configure batch size
spring:
jpa:
properties:
hibernate:
jdbc:
batch_size: 50
order_inserts: true
order_updates: true
Virtual Threads (Java 21+; prefer Java 25)
Virtual threads enable massive concurrency for I/O-bound workloads without the complexity of async programming. Unlike platform threads (expensive, limited to thousands), virtual threads are lightweight (millions possible) and managed by the JVM. When a virtual thread blocks on I/O, the carrier platform thread is freed to run other virtual threads. See Java Concurrency for comprehensive coverage.
Configuration
spring:
threads:
virtual:
enabled: true # Tomcat uses virtual threads for requests
# Tomcat virtual threads
server:
tomcat:
threads:
max: 200 # Still applies but less critical - virtual threads scale better
Usage for Blocking I/O
@Service
public class PaymentService {
// Virtual threads make blocking I/O cheap
public Payment processPayment(PaymentRequest request) {
// Call external payment gateway (blocking I/O)
var gatewayResult = paymentGateway.process(request);
// Call fraud service (blocking I/O) - doesn't block platform thread
var fraudCheck = fraudService.check(request);
// Save to database (blocking I/O)
return repository.save(Payment.fromGatewayResult(gatewayResult));
}
// Parallel execution with virtual threads
public PaymentResult processWithParallelChecks(PaymentRequest request) {
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
var gatewayFuture = executor.submit(() ->
paymentGateway.process(request));
var fraudFuture = executor.submit(() ->
fraudService.check(request));
var complianceFuture = executor.submit(() ->
complianceService.verify(request));
var gatewayResult = gatewayFuture.get();
var fraudResult = fraudFuture.get();
var complianceResult = complianceFuture.get();
return PaymentResult.from(gatewayResult, fraudResult, complianceResult);
} catch (Exception e) {
throw new PaymentProcessingException("Failed to process payment", e);
}
}
}
JVM Tuning
JVM performance depends heavily on heap size and garbage collector selection. The heap stores all Java objects; when full, GC pauses execution to reclaim memory. Proper tuning balances throughput (minimize GC overhead) against latency (minimize pause times).
Garbage collection is the biggest source of performance variability in Java applications. Different GC algorithms make different tradeoffs: throughput-oriented GCs (Parallel GC) maximize application time but have longer pauses, while latency-oriented GCs (G1, ZGC) minimize pause times at the cost of some throughput. Choose based on your latency requirements.
The heap is divided into young generation (short-lived objects) and old generation (long-lived objects). Most objects die young, so the GC focuses on the young generation with minor collections (fast, frequent). Objects that survive multiple minor collections are promoted to old generation. Major collections (collecting old generation) are expensive and should be infrequent.
Heap Size Configuration
Set -Xms (initial heap) equal to -Xmx (maximum heap) to avoid dynamic resizing overhead. Size heap based on application memory requirements plus headroom for allocation spikes.
# Production JVM settings
java -Xms2G -Xmx2G \ # Fixed heap size avoids resizing
-XX:+UseG1GC \ # G1 garbage collector
-XX:MaxGCPauseMillis=200 \ # Target max pause time
-XX:+UseStringDeduplication \ # Reduce string memory usage
-XX:+ParallelRefProcEnabled \ # Parallel reference processing
-jar payment-service.jar
G1GC Configuration (Recommended for Most Applications)
G1 (Garbage First) is a low-latency collector that divides the heap into regions and collects the regions with the most garbage first. It targets pause time goals while maintaining good throughput.
java -Xms4G -Xmx4G \
-XX:+UseG1GC \ # Default since Java 9
-XX:MaxGCPauseMillis=200 \ # Pause time target (not guarantee)
-XX:G1HeapRegionSize=16M \ # Region size (1-32MB, auto-tuned)
-XX:InitiatingHeapOccupancyPercent=45 \ # Start concurrent cycle at 45% heap
-XX:G1ReservePercent=10 \ # Reserve 10% heap to prevent promotion failures
-XX:G1NewSizePercent=30 \ # Min young generation size
-XX:G1MaxNewSizePercent=40 \ # Max young generation size
-jar payment-service.jar
ZGC Configuration (For Low-Latency Requirements)
ZGC (Z Garbage Collector) is an ultra-low-latency collector targeting sub-10ms pause times regardless of heap size. It performs most work concurrently with application threads. Use ZGC when pause time is more critical than throughput.
java -Xms8G -Xmx8G \
-XX:+UseZGC \ # Enables ZGC
-XX:ZCollectionInterval=5 \ # Hint for GC frequency (seconds)
-XX:+UseCompressedOops \ # Compress object pointers <32GB heap
-XX:+UseCompressedClassPointers \ # Compress class pointers
-jar payment-service.jar
GC Logging
java -Xms2G -Xmx2G \
-XX:+UseG1GC \
-Xlog:gc*:file=gc.log:time,uptime,level,tags:filecount=10,filesize=100M \
-jar payment-service.jar
Lazy Initialization
Lazy initialization defers expensive object creation until the object is actually needed. This improves startup time and reduces memory usage when the object might not be used in every code path. However, it adds complexity (thread safety, null checks) and can make performance unpredictable (first call is slow).
Use lazy initialization for truly expensive resources that aren't always needed. For resources used in every code path, eager initialization is simpler and has predictable performance characteristics.
Lazy Loading with Supplier
The Supplier functional interface (Java 8) provides a clean way to implement lazy initialization. Suppliers.memoize() from Guava wraps a supplier to cache the result after first invocation, making subsequent calls fast. This is thread-safe and handles the common "compute once, use many times" pattern.
public class PaymentProcessor {
// BAD: Eager initialization of expensive resource that might not be used
private final ExpensiveResource resource = new ExpensiveResource();
// GOOD: Lazy initialization
private Supplier<ExpensiveResource> resourceSupplier =
Suppliers.memoize(() -> new ExpensiveResource());
public void process(Payment payment) {
if (payment.requiresSpecialProcessing()) {
ExpensiveResource resource = resourceSupplier.get();
resource.process(payment);
}
}
}
Lazy Singleton Pattern
The "initialization-on-demand holder" pattern provides thread-safe lazy initialization without synchronization overhead. It exploits the Java class loading mechanism - the inner Holder class isn't loaded until getInstance() is first called. Class loading is inherently thread-safe (guaranteed by the JVM), so no explicit synchronization is needed.
This pattern is superior to double-checked locking (which is subtle and error-prone despite being "fixed" in Java 5) and eager initialization (which initializes even if never used). It's the recommended pattern for lazy singletons in Java.
public class ExpensiveService {
private ExpensiveService() {
// Expensive initialization (e.g., loading large lookup tables)
}
// Thread-safe lazy initialization via class loading semantics
private static class Holder {
// JVM guarantees this is initialized exactly once, thread-safely
static final ExpensiveService INSTANCE = new ExpensiveService();
}
public static ExpensiveService getInstance() {
// First call loads Holder class, initializing INSTANCE
// Subsequent calls just return the cached instance
return Holder.INSTANCE;
}
}
Performance Monitoring
Micrometer Metrics
@Service
@RequiredArgsConstructor
public class PaymentService {
private final MeterRegistry meterRegistry;
private final Timer paymentTimer;
public PaymentService(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
this.paymentTimer = Timer.builder("payment.processing.time")
.description("Payment processing duration")
.publishPercentiles(0.5, 0.95, 0.99)
.register(meterRegistry);
}
public Payment createPayment(PaymentRequest request) {
return paymentTimer.record(() -> {
var payment = processPayment(request);
meterRegistry.counter("payment.created",
"status", payment.getStatus().name()).increment();
meterRegistry.gauge("payment.amount",
Tags.of("currency", payment.getCurrency()),
payment.getAmount().doubleValue());
return payment;
});
}
}
Custom Performance Annotations
@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
public @interface Measured {
String value();
}
@Aspect
@Component
public class PerformanceMeasurementAspect {
private final MeterRegistry meterRegistry;
@Around("@annotation(measured)")
public Object measurePerformance(ProceedingJoinPoint joinPoint, Measured measured)
throws Throwable {
Timer.Sample sample = Timer.start(meterRegistry);
try {
return joinPoint.proceed();
} finally {
sample.stop(Timer.builder(measured.value())
.tag("class", joinPoint.getTarget().getClass().getSimpleName())
.tag("method", joinPoint.getSignature().getName())
.register(meterRegistry));
}
}
}
// Usage
@Service
public class PaymentService {
@Measured("payment.create")
public Payment createPayment(PaymentRequest request) {
// Method automatically measured
}
}
Performance Testing
JMH Benchmarking
build.gradle:
dependencies {
jmh 'org.openjdk.jmh:jmh-core:1.37'
jmh 'org.openjdk.jmh:jmh-generator-annprocess:1.37'
}
plugins {
id 'me.champeau.jmh' version '0.7.2'
}
Benchmark:
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(1)
public class PaymentFormatterBenchmark {
private Payment payment;
@Setup
public void setup() {
payment = Payment.builder()
.id("PAY-123")
.amount(new BigDecimal("100.00"))
.currency("USD")
.build();
}
@Benchmark
public String formatWithConcatenation() {
return "Payment: " + payment.getId() +
", Amount: " + payment.getAmount();
}
@Benchmark
public String formatWithStringBuilder() {
return new StringBuilder()
.append("Payment: ").append(payment.getId())
.append(", Amount: ").append(payment.getAmount())
.toString();
}
@Benchmark
public String formatWithStringFormat() {
return String.format("Payment: %s, Amount: %s",
payment.getId(), payment.getAmount());
}
}
Run with:
./gradlew jmh
Further Reading
General Performance Concepts:
- Performance Overview - Performance strategy and principles
- Performance Optimization - Cross-language optimization techniques
- Performance Testing - Load testing strategies
Java-Specific:
- Java General - Java language best practices
- Java Concurrency - Virtual threads and concurrency patterns
- Spring Boot Observability - Monitoring and metrics
External Resources:
Summary
Key Takeaways:
- Profile first: Use JFR, Async Profiler, or VisualVM to identify bottlenecks
- Minimize allocations: Reuse objects, avoid autoboxing, pre-size collections
- Choose right collections: ArrayList for lists, HashMap for maps, HashSet for sets
- Cache strategically: Use Caffeine for local caching with TTL and size limits
- Connection pooling: Configure HikariCP with appropriate pool size
- Virtual threads: Use for blocking I/O workloads (Java 21+; prefer Java 25)
- JVM tuning: Configure heap size and GC appropriately (G1GC or ZGC)
- Batch operations: Use batch inserts/updates to reduce database round trips
- Monitor metrics: Track P95/P99 latency, throughput, and cache hit rates
- Benchmark changes: Use JMH to validate performance improvements