InputStream buffering

On the previous pages, we've seen how to read bytes from an InputStream and to correctly handle I/O errors while reading. In our examples, so far, we have been reading byte by byte from the stream. This turns out to be inefficient for many types of InputStream.

Calling read() for every single byte on a FileInputStream (and many other types of input streams) means that for every single byte, Java will call a native operating system method. And calling native methods from Java is often a relatively expensive operation1. (Most operating systems will do some amount of underlying buffering so that there wouldn't be, for example, a separate disk read for every single byte read: so in many cases, it is essentially the cost of the OS call that is the big problem.)

You'll recall that InputStream also provides versions of read() for reading multiple bytes into an array. Provided that the subclass in question actually optimises these methods (and, for example, FileInputStream does), then we can make a single OS call to read multiple bytes. So in our example of checking if a file is a JPEG file, we could create a four-element byte array and fill the array in a single call, then read the bytes from the array to check if they match the signature of a JPEG file.

In practice, using the multi-byte read() calls in this way isn't always convenient. For one thing, these calls aren't guaranteed to read the requested number of bytes, even if available. So we still need to sit in a loop, reading until we have filled the array (even though in practice, we probably would read the first four bytes in one go).

Buffering with BufferedInputStream

To make life a bit easier, Java provides a "wrapper" input stream called BufferedInputStream. This is constructed around another base input stream such as FileInputStream, but buffers reads in the background. That is, we can call the single-byte version of read(), and BufferedInputStream will behind the scenes read multiple bytes from the file into a buffer and then serve us bytes from the buffer. This "wrapper" model means that we can add buffering with a single line of code:

public boolean isJpegFile(File f) throws IOException {
  InputStream in = new FileInputStream(f);
  in = new BufferedInputStream(in);
  try {
    return (in.read() == 'J' &&
            in.read() == 'F' &&
            in.read() == 'I' &&
            in.read() == 'F');
  } finally {
    try { in.close(); } catch (IOException ignore) {}
  }
}

In practice, it's common to chain the constructors together to make the code a bit more elegant:

in = new BufferedInputStream(new FileInputStream(f));

Should the buffer construction go inside or outside the try/catch block?

Note in the above example that we assume that an error won't occur constructing the BufferedInputStream after we've created the FileInputStream. If this really happened, the finally clause wouldn't be executed with the above structure. However, this course of events is so unlikely that we'd probably live with it. Note that the file would eventually get closed: FileInputStream has an implementation of finalize which "in emergencies" performs the close prior to garbage collection. Note that the most likely cause of being unable to create a BufferedInputStream is probably an OutOfMemoryError, and if one of those occurs, you've got much more to worry about that timely closure of a file...

When should you use BufferedInputStream?

As a general rule of thumb, you should always wrap an input stream in a BufferedInputStream except where you know there's other buffering going on. For example, if you are calling the multi-byte reads on a FileInputStream and reading a reasonably large number of bytes (say, a few K) at a time, then adding an extra BufferedInputStream probably won't give you much benefit and could even slow down your I/O slightly due to the extra buffer copying. But if you're not sure, the penalty for not buffering is generally much greater than the penalty of an unnecessary extra layer of buffering.

Java's model of implementing I/O buffering as an extra InputStream in the chain is genreally quite neat (as you saw above, it means we can generally add I/O buffering to arbitrary code with a single line of code). A disadvantage is that library methods that say they take an InputStream for input don't consistently state whether they then buffer the input or whether they expect the input to be pre-buffered. Similarly, it's not always clear whether some unspecified flavour of InputStream returned by a method already implements buffering. If in doubt, buffer.

Configuring the buffer size

When creating a BufferedInputStream, it is possible to specify the buffer size in bytes. The buffer size can affect both the overall read time (requesting a larger number of bytes at a time can involve fewer "round trips" to a hard disk or network to request data) and CPU time (the more data asked for at a time, the less time proportionally is likely to be spent inside the "housekeeping code" around each data request). It turns out that the default buffer size is generally a good choice: see here for more information on choosing an input buffer size with BufferedInputStream.


Notes:
1. There are some exceptions to this: some very basic native methods such as the various functions in java.lang.Math actually get converted "directly" into machine instructions by modern VMs.


If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.

Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.