Build Adapters for Reuse
You can save space and even time by compressing your persistent data store. The interesting trick is to provide this functionality by adapting your current algorithms, rather than completely replacing them.
Technology Toolbox: C#
Our industry has slowly but surely begun to embrace XML as the lingua franca for all file-based persistent storage.
In some ways, that's good. It means that your persistent storage is in a format that is reasonably open and accessible, even when the originating program is not available. It also simplifies creating multiple programs that can access the same data store. There is a downside, however. It's hard to imagine a more verbose format for data than an XML file. Meaningful tags, whitespace, and other attribute information all add size to the persistent storage when using XML for persistence. Yes, it's true that disk space is cheap, and there's little need to find a more compact format as we move data to disk. But today's applications often need to move data from one machine to another across the network. And that's where the size of the transmission still matters and where compression can really pay off.
Fortunately, you can use compressible streams in C# in a way that minimizes its impact on other code. You can also apply these same techniques for minimizing code impact in other, similar situations.
The .NET class library supports compression using the classes in the System.IO.Compression namespace. You can choose two different streams: Deflate or GZip. Both classes (DeflateStream, and GZipStream) are derived from System.IO.Stream, so using them is straightforward. You create a compression (or decompression) stream using the underlying stream as a source, and then read or write as you would to any other stream. It requires a grand total of 20 lines of code to read and write an object to a compressed file stream (Listing 1).
Begin by creating the target stream—a file, in this case. Your target stream could also be a memory stream, or any other class derived from System.IO.Stream, such as a network stream. Next, you write to or read from the compressed stream.
This code is quite simple, but it leaves a lot to be desired. As written, you need to copy-and-paste it every time you need to write or read a compressed file. That's highly inefficient. But you can do better than that. Small algorithms like this one are perfect candidates for creating once, and then reusing them.
Your next task is to refactor this code and create a way to adapt any target stream into a compressed stream. This is typically a trivial and straightforward exercise, but it proves a bit trickier here. Examine the order of the operations. First, you create the target stream. Second, you create a compressed stream that points to the target stream. Finally, you write the data to storage. But not so fast. Notice that GZipStream and the FileStream both implement the IDisposable interface.
Dispose of Objects Properly
For your code to work correctly and efficiently, you need to make sure that you dispose of these objects when you're done writing your storage. This fact has two important implications for how you must create the reusable components you need. The first part of the code allocates what you need, while the middle section contains the custom code. The final section contains the cleanup code. The middle section is where you need to provide hooks that enable you to call the custom code, without disturbing the allocation or cleanup code surrounding it.
You can solve this problem in any of several ways, but my approach in this article uses C# 3.0's anonymous delegates. In pseudo-form code, it looks like this:
// create a method to write the document.
// Create another method to create the compressed
// stream, call your write document method,
// then release the compressed stream.
// Create a third method that creates the file
// stream, call the write method, and release
// the file stream.
This description makes the code sound trickier than it is. The only new technique is something called deferred execution. Deferred execution means that you create a snippet of code that can be called later in the execution stream. In this case, that method definition performs the core save process. The easiest way to start is to begin at the outer layer and move inward. Note that some of the interim steps are not particularly easy to understand, but bear with me, and you'll see that the final solution is quite elegant and easy to use.
The next step is to refactor the code so it writes the uncompressed stream (Listing 2). You'll use this technique extensively as you continue to modify this algorithm, so let's examine it closely.
The first thing you must do is define a delegate definition for the method that writes the contents to the open stream:
public delegate void WriteToStream(Stream s);
The WriteFileStream method opens the stream:
using (FileStream file = new FileStream(pathName,
FileMode.OpenOrCreate | FileMode.Truncate,
Next, it calls a delegate that matches the WriteToStream signature:
Finally, the closing brace disposes of the file stream. The calling code, which I'll discuss in a moment, shows how to create the anonymous delegate that implements writing the file.
The method call defines the two parameters. The first parameter is the name of the output file:
SampleData s = new SampleData(true);
The second parameter contains the body of the anonymous delegate:
using (GZipStream compressedStream = new
XmlSerializer ser = new
The delegate keyword declares an anonymous delegate; the items in parentheses declare the parameter list. The remainder of the code is just like any other method you write. What you've done so far is to create a pattern that allows you to customize the code that's between the initialization and cleanup code. This is a powerful way to put different building blocks together.
Next, you use this same approach to refactor the snippet of code that opens and uses the compressed stream. Simply create a different method that initializes and closes the compressed stream, and then call a user defined delegate to write the content (Listing 3).
Build Libraries of Delegates
And now we get to the complicated part. The good news is that it's not as bad as it might seem, once you look at the problem a bit. The second parameter to the WriteCompressedStream method call contains the code to write the XML file. You package that code as an anonymous delegate. Moving out one block, the second parameter to the WriteFileStream method creates the compressed stream, writes the contents, and returns. You also package the second parameter for the WriteFileStream method as a delegate.
At this point, you've packaged the file stream and the compressed stream in such a way that you can reuse the code. The next step is to adapt this code to reusable XML serialization code by applying the lessons from last month's column [C#, Generics: Move Beyond Collections, VSM April 2007]:
SampleData s = new SampleData(true);
This code is elegant, but rather dense and not the easiest to understand. Your last step should be to make some changes to the package that enable developers to understand what is going on more easily:
public static void SaveCompressedXml<T>(string
pathName, T data)
WriteFileStream(pathName, delegate(Stream file)
Next, repeat this process with your read code:
public static T ReadCompressedXml<T>(
T s2 = GenericSerializer.ReadFileStream<T>(
(s, delegate(Stream compressedStream)
Note that there are two major differences between the write code and the read code. First, you need to specify the type parameters in the read code because they are return values (see last month's column). Second, the anonymous delegates return values, but the delegate keyword doesn't indicate the type of the return value. The return type is inferred by the body of the delegate.
You might be wondering why I went to the trouble of separating the code this way. The answer boils down to ensuring maximum reuse. What this structure lacks in simplicity it more than makes up for in functionality and reusability.
The first and easiest upgrade is to use a memory stream instead of a file stream. Change the outer method to use a memory stream and return the storage:
public static Byte WriteMemoryStream(
using (MemoryStream storage =
This example should open your eyes to the fact that you can adapt these small algorithms to many other uses. You can mix and match compressed streams, uncompressed streams, file streams, network streams, and memory streams by combining these small methods in different ways. You can even change the inner most algorithm to use the binary or SOAP seralizer, or you can write the contents in whatever format you wish.
Separating an outer algorithm (file or stream management) from an inner algorithm (reading or writing the contents of the stream) is useful in many different kinds of applications. Better still, anonymous delegates give you an elegant and efficient way to decouple the inner and outer portions of an algorithm.
Analyzing the Results
The compressed format is a binary format, but you shouldn't consider it a secure or encrypted format. A compressed format might keep your users from editing the storage by hand, but it won't keep even the most amateur hacker from getting at the underlying data. A full discussion of cryptography is well beyond the scope of this article, but you can extend the code in this sample to include encryption, if that's necessary for your application.
After writing all this code, you should examine the tradeoffs in space and performance. Remember that your results will almost certainly differ, and you should make your own measurements, following my example.
You can begin by examining the space savings. For this article, I created a test program that builds a collection containing every exported type in MSCORLIB. Here's the XML file's header:
The entire file is 81 KB. The compressed version is 11KB. That is an impressive space savings. Be sure to perform your own measurements before you convert all your IO routines to compressed IO. The 8:1 factor is based on one set of data; your results might vary greatly.
The next test is to measure the write and read time for both uncompressed and compressed data. You might think that the compressed IO would take more CPU cycles, and therefore, more time, but my testing doesn't bear that out. The uncompressed version takes 0.172 seconds to write and read the file. The compressed version takes 0.014 seconds. I believe that's because the file operations are IO bound, not CPU bound. You might see different results for memory streams, or different data sets. In all performance related activities, test your assumptions before making any code changes.
That's about it. You're now able to add compression to your serialization streams. More importantly, you can now structure your code in small, reusable methods that you can adapt to many different needs, as new requirements are discovered. These anonymous delegates don't mean you write more code. And, with a little practice, they become as easy to create as the code you write today. Finally, take a careful look at each of the methods in isolation. They are incredibly simple, and none is more than 10 lines of code.