How to save memory occupied by Java strings

So far we have discussed the memory usage of Java strings, including the memory usage of StringBuffer (and StringBuilder). We also mentioned that in many applications, Strings can account for a significant proportion of a program's memory usage. On this page, we'll look at some ways to reduce that memory usage. Before we do so, it's worth saying from the outset:

The techniques shown here are generally recommended only when you have determined that memory usage of strings is causing you a problem. It usually isn't worth prematurely optimising memory usage of Strings.

That said, if you decide you do need to optimise your program's use of strings, then which method is more effective depends to some extent on whether you actually need an instance of String. If you're only going to perform the following operations, then any old CharSequence will do— you don't actually need a String:

We'll see below that if you can write your operations to work with any CharSequence rather than specifically a String, then you can potentially save memory.

If you don't need a String

If you don't actually need to store your string as a Java String object, then other alternatives may have a lower memory footprint. For example:

See our example of a one-byte-per-character CharSequence implementation on the next page.

If you really need instances of String

The first thing to remember is that if you can find a way to store your string in another object then you can always convert to a String "on the fly" before calling a method that needs it. Any CharSequence, including our CompactCharSequence class, provides a toString() method that performs this conversion.

If you really need your strings as Strings, then you may still be able to save some memory:

String Canonicalisation

Canonicalisation is the technique of ensuring that there is only one actual object with a given unique content, then every time we need an object with that content, we create a reference to the single object. For example, if we have several a thousand database entities in memory each with a string containing the characters either "FULL_TIME_EMPLOYEE" or "PART_TIME_EMPLOYEE", we'd like to create a thousand references to one of two String objects, rather than a thousand String objects.

In other words, we'd like to keep a pool of strings. Every time we need to store a string with a given content, we first check to see if there's already a string with that content in our pool. If there isn't, we added it. Otherwise, we use a reference to the string from the pool instead of keeping on to a new one. A simple (but not necessarily flexible) way to do this is via the String.intern() method. Another is to explicitly use something such as a HashMap.

Canonicalisation with String.intern()

The running JVM actually already contains a string pool, which it uses for any literal string hardcoded in classes and other special purposes such as class and method names etc. You can ask the VM to pool any given string by calling the intern() method on that string. So, say, when reading the EmployeeStatus field from our database, we can do something such as the following:

ResultSet rs = retrieveEmployeesFromDatabase();
while (rs.next()) {
  int employeeId = rs.getInt(1);
  String employeeStatus = rs.getString(2);
  employeeStatus = employeeStatus.intern();
  ... construct entity ...
}

Using the intern() method gives us the slight performance advantage that we can compare interned strings using the == method. For example, the following would produce expected results:

if (employee.status == "FULL_TIME_EMPLOYEE") {
  ...
}
if (employee1.status == employee2.status) {
  ...
}

In actual fact, I would reommend still using equals(). That way, you're less likely to get a bug from forgetting to intern(). And since the first operation of String.equals() is to compare the actual object references for equality anyway, so the performance gain of avoiding the method call (bearing in mind the JIT compiler could even inline it anyway) will be negligible.

The principal problem with this approach is that there is no way to de-intern a String once it has been interned. Thus, using intern() is liable to cause a memory leak if you are not careful. We also have no real way to query the current status of the JVM's internal string pool and find out how full it is— the first we'll know is probably when we get an OutOfMemoryError calling intern() (or worse, at some random point thereafter...).

In general, you should never call intern() on user-generated strings, as your application will be susceptible to an attack whereby the user deliberately sends a large number of different strings to your application and fills up the memory with pooled strings that will never be removed.

So another option is to "roll our own" string pool.

Canonicalisation with an explicit collection (HashMap etc)

A more controllable approach is to use an explicit Java collection to implement your own "string pool". A typical way to do this is to create a map that maps each string to itself. In concurrent environments, a good choice is ConcurrentHashMap. The essential idea is that we create a string pool as follows:

public class StringPool {
  private ConcurrentMap<String,String> map =
    new ConcurrentHashMap<String,String>(1000);

  public String getCanonicalVersion(String str) {
    if (map.size() > 10000) {
      map.clear();
    }
    String canon = map.putIfAbsent(str, str);
    return (canon == null) ? str : canon;
  }
}

Recall that the putIfAbsent() method returns null if the mapping was added (because it wasn't there before), in which case we return str as it was the first instance of that string. Otherwise, putIfAbsent() returns the previous value associated with the given string— the "canonicalised" version added on a previous call to getCanonicalVersion()— and we return that.

Now, with an instance of StringPool in hand, we replace our intern() call from the code above with:

employeeStatus = pool.getCanonicalVersion(employeeStatus);

Coming back to the getCanonicalVersion(), notice that the first thing we do is check if our pool has got too big. If it has, we simply clear it out in this example. Arguably, there are more sophisticated things we could do, but the point is that we're able to do something to reduce the chances of a memory leak. A downside of naively clearing out the string pool is that we now might create more than one object with the same content. So the technique of using == to compare canonicalised strings (which, as I say, I generally don't recommend anyway) now won't be reliable.

WeakHashMap

Another option for string canonicalisation (and object canonicalisation in general) is WeakHashMap. A WeakHashMap automatically clears out mappings when the keys and/or values no longer have other references. However, it has the disadvantage of requiring explicit synchronization, so may not be a good choice for highly concurrent access by multiple threads.

What concurrent map with similar functionality to WeakHashMap is apparently planned for inclusion in Java 7.


If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.

Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.