How to save memory occupied by Java strings
So far we have discussed the memory usage of Java strings,
including the memory usage of StringBuffer (and StringBuilder).
We also mentioned that in many applications, Strings can account for a significant proportion of
a program's memory usage. On this page, we'll look at some ways to reduce that memory usage. Before we do
so, it's worth saying from the outset:
The techniques shown here are generally recommended only when you have determined that
memory usage of strings is causing you a problem. It usually isn't worth
prematurely optimising memory usage of Strings.
That said, if you decide you do need to optimise your program's use of strings,
then which method is more effective depends to some extent on whether you actually need an instance
of String. If you're only going to perform the following operations, then any old
CharSequence will do— you don't actually need a String:
- storing the string in memory;
- printing out the string or writing it to a stream;
- retrieving individual characters/substrings from the string;
- performing matches against regular expressions.
We'll see below that if you can write your operations to work with any CharSequence rather
than specifically a String, then you can potentially save memory.
If you don't need a String
If you don't actually need to store your string as a Java String object, then other
alternatives may have a lower memory footprint. For example:
- a StringBuilder or StringBuffer truncated to the exact length of the string (via the
trimToSize() method) uses 8 bytes less than a String with the same content and is
almost as convenient to use in many cases;
- if you only need one byte per character (e.g. because you're only using ASCII or ISO-8859-1 character
encoding), then you can store a string in a byte array, and convert to a String
"at the last minute" whenever you need to pass it to a method;
- if more convenient, you can create a CharSequence implementation that reads the
characters out of the byte array on the fly.
See our example of a one-byte-per-character CharSequence
implementation on the next page.
If you really need instances of String
The first thing to remember is that if you can find a way to store your string in another object
then you can always convert to a String "on the fly" before calling a method that needs it.
Any CharSequence, including our CompactCharSequence class,
provides a toString() method that performs this conversion.
If you really need your strings as Strings, then you may still be able to save some
memory:
- when creating substrings of strings, remember that you'll waste memory
if you throw away the parent string: in that case, consider constructing a new string
around the substring;
- if you have many strings with the same content (e.g. because you have Java objects
that encapsulate rows from a database where certain columns tend to have one of a fixed set of values),
consider:
- "casting" the string to an enum and back again;
- canonicalising the strings.
String Canonicalisation
Canonicalisation is the technique of ensuring that there is only one actual object
with a given unique content, then every time we need an object with that content, we create a reference
to the single object. For example, if we have several a thousand database entities in memory
each with a string containing the characters either "FULL_TIME_EMPLOYEE" or "PART_TIME_EMPLOYEE",
we'd like to create a thousand references to one of two String objects, rather
than a thousand String objects.
In other words, we'd like to keep a pool
of strings. Every time we need to store a string with a given content, we first check to see if
there's already a string with that content in our pool. If there isn't, we added it. Otherwise,
we use a reference to the string from the pool instead of keeping on to a new one.
A simple (but not necessarily flexible) way to do this is via the String.intern() method.
Another is to explicitly use something such as a HashMap.
Canonicalisation with String.intern()
The running JVM actually already contains a string pool, which it uses for any
literal string hardcoded in classes and other special purposes such as class and method names etc.
You can ask the VM to pool any given string by calling the intern() method on that string.
So, say, when reading the EmployeeStatus field from our database, we can do something
such as the following:
ResultSet rs = retrieveEmployeesFromDatabase();
while (rs.next()) {
int employeeId = rs.getInt(1);
String employeeStatus = rs.getString(2);
employeeStatus = employeeStatus.intern();
... construct entity ...
}
Using the intern() method gives us the slight performance
advantage that we can compare interned
strings using the == method. For example, the following would produce expected results:
if (employee.status == "FULL_TIME_EMPLOYEE") {
...
}
if (employee1.status == employee2.status) {
...
}
In actual fact, I would reommend still using equals(). That way, you're less likely
to get a bug from forgetting to intern(). And since the first operation of String.equals()
is to compare the actual object references for equality anyway, so the performance gain of avoiding the method
call (bearing in mind the JIT compiler could even inline it anyway) will be negligible.
The principal problem with this approach is that there is no way to
de-intern a String once it has been interned. Thus, using intern()
is liable to cause a memory leak if you are not careful. We also have no
real way to query the current status of the JVM's internal string pool and find out how full
it is— the first we'll know is probably
when we get an OutOfMemoryError calling intern() (or worse, at some random
point thereafter...).
In general, you should never call intern() on user-generated strings,
as your application will be susceptible to an attack whereby the user deliberately
sends a large number of different strings to your application and fills up the memory with
pooled strings that will never be removed.
So another option is to "roll our own" string pool.
Canonicalisation with an explicit collection (HashMap etc)
A more controllable approach is to use an explicit Java collection to implement your own
"string pool". A typical way to do this is to create a map that maps each string to
itself. In concurrent environments, a good choice is ConcurrentHashMap. The essential idea is that we create a string pool as
follows:
public class StringPool {
private ConcurrentMap<String,String> map =
new ConcurrentHashMap<String,String>(1000);
public String getCanonicalVersion(String str) {
if (map.size() > 10000) {
map.clear();
}
String canon = map.putIfAbsent(str, str);
return (canon == null) ? str : canon;
}
}
Recall that the putIfAbsent() method returns null if the mapping was
added (because it wasn't there before), in which case we return str as it was the
first instance of that string. Otherwise, putIfAbsent() returns the previous
value associated with the given string— the "canonicalised" version added on a previous
call to getCanonicalVersion()— and we return that.
Now, with an instance of StringPool in hand, we replace our intern()
call from the code above with:
employeeStatus = pool.getCanonicalVersion(employeeStatus);
Coming back to the getCanonicalVersion(), notice that
the first thing we do is check if our pool has got too big. If it has, we simply clear it
out in this example. Arguably, there are more sophisticated things we could do, but the point
is that we're able to do something to reduce the chances of a memory leak.
A downside of naively clearing out the string pool is that we now might create
more than one object with the same content. So the technique of using == to compare
canonicalised strings (which, as I say, I generally don't recommend anyway) now won't be
reliable.
WeakHashMap
Another option for string canonicalisation (and object canonicalisation in general)
is WeakHashMap. A WeakHashMap automatically clears out mappings when
the keys and/or values no longer have other references. However, it has the
disadvantage of requiring explicit synchronization, so may not be a good
choice for highly concurrent access by multiple threads.
What concurrent map with similar functionality to
WeakHashMap is apparently planned for inclusion in Java 7.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.