Saving memory used by Java strings: a one-byte-per-character CharSequence implementation
As mentioned in our discussion of how to save memory occupied by Java
Strings, many methods and operations will actually work on any old CharSequence: it
doesn't specifically have to be a String. In case you're not familiar with it,
CharSequence is an interface that requires us to provide methods to:
- retrieve the length of the sequence;
- retrieve the character at a given position;
- create a CharSequence instance representing a given subsequence;
- create a String from the CharSequence.
Library classes that store sequences of characters (String, StringBuffer,
StringBuilder, CharBuffer) already implement CharSequence.
But significantly, if we define some object holding our string that has just these methods, then
we are free to store the characters more compactly if we can.
And the resulting object can be used whenever a method requires only a CharSequence:
i.e. as a drop-in replacement for String when storing, printing, writing to file—
even running regular expressions on the string.
To implement our character sequence, we can fairly easily do
the following:
- store the characters of our string in a byte array, with one byte
per character (i.e. we only allow ASCII characters or some range of 256 characters, not full Unicode!);
- provide the above wrapper methods to pull appropriate bytes out of the array and convert
to characters.
We'll call our class the CompactCharSequence and it will start as follows:
public class CompactCharSequence implements CharSequence, Serializable {
static final long serialVersionUID = 1L;
private static final String ENCODING = "ISO-8859-1";
private final int offset;
private final int end;
private final byte[] data;
public CompactCharSequence(String str) {
try {
data = str.getBytes(ENCODING);
offset = 0;
end = data.length;
} catch (UnsupportedEncodingException e) {
throw new RuntimeException("Unexpected: " + ENCODING + " not supported!");
}
}
public char charAt(int index) {
int ix = index+offset;
if (ix >= end) {
throw new StringIndexOutOfBoundsException("Invalid index " +
index + " length " + length());
}
return (char) (data[ix] & 0xff);
}
public int length() {
return end - offset;
}
}
This part should hopefully be fairly straightforward. We provide a constructor with which
we can create an instance of CompactCharSequence, passing in the String
containing the characters we want to store. We could of course allow something else such
as a char array to be passed in. When passing in a String, we use the
getBytes() method to retrieve the contents of the string as a byte array, asking
for a 1-byte-per-character encoding. In this case, ISO-8859-1 is the standard 1-byte-per-character encoding
used for many European languages (notably except Russian). Notice that we store also an offset
and end, because in our implementation we choose to share the array when
subsequences are created. That's just a design decision: we could instead copy a section of the
array each time and dispense with the two extra fields.
The charAt() and length()
methods are then fairly trivial: we just need to be careful to note that the first character
of the string isn't necessarily the first character of the array in our implementation, and similarly that
the last character of the sequence may actually not be the last character of the array.
To convert a byte to a char, we AND with 255 (0xff hex), which effectively removes
the sign.
Creating subsequences
We now need to provide a method to create a subsequence of a given sequence. We said that
we're going to share the array when we do so. So firstly, we provide an extra private
constructor, which takes not only the array of chars but also a start offset and end point in the array:
private CompactCharSequence(byte[] data, int offset, int end) {
this.data = data;
this.offset = offset;
this.end = end;
}
Now, our public subSequence() method is trivial:
public CharSequence subSequence(int start, int end) {
if (start < 0 || end >= (this.end-offset)) {
throw new IllegalArgumentException("Illegal range " +
start + "-" + end + " for sequence of length " + length());
}
return new CompactCharSequence(data, start + offset, end + offset);
}
Implementing toString()
Finally, we need to override toString() to provide an actual String
representation of the CharSequence. For this, we can just pass the char array into
a String constructor, remembering to specify the encoding in which the bytes are
stored, and the start offset in the array plus the length of the string:
public String toString() {
try {
return new String(data, offset, end-offset, ENCODING);
} catch (UnsupportedEncodingException e) {
throw new RuntimeException("Unexpected: " + ENCODING + " not supported");
}
}
And voilà! We have a class that will store a string in roughly half the
memory of a String, and is almost as convenient to use. We can pass instances of
CompactCharSequence to any method that will take a CharSequence,
notably PrintWriter.println() and the regular expression API. When implementing
our own methods that work on strings, we should generally consider making them accept
any CharSequence rather than specifically a String. And in the worst
case, we can convert a CompactCharSequence to a String "on the fly"
if we need to pass it to some method that requires a String.
Other methods to implement
As it stands, our CompactCharSequence wouldn't be suitable for use as a key to
a hash map (a typical use of String). I actually recommend CompactCharSequence
primarily for cases where we need to cache in memory the content of something such as a web
page or file in string form and treat it as content, rather than where we are using the
character sequence "funcionally" for something such as a hash map key.
But if we did require the latter functionality, then we would need to
override hashCode() and equals().
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.