Part I – The Background
Overview
I recently had the task of writing an External BLOB/Binary Store (EBS) implementation for Windows SharePoint Services v3.0 SP1. In this series of blog entries I want to pass on what I’ve learned during that process. I’m going to go over the information available from Microsoft, the architectural decisions that you have to make, things to consider when implementing the COM interface, and finally things to consider when implementing the orphaned binary file cleanup process. I’ll also cover some debugging tips for this process and things to consider for deployment.
Due to the sheer amount of information I’m going to break this down into the four blog entries. The first is the background information (this blog entry) that explains what to expect, introduces the Microsoft documentation, and discusses the architectural decisions that need to be made as well as which ones I made for this series. The next two blogs will be discussing the different architectural areas in (sometimes painful) detail. The final blog will be on deployment, debugging, and any other random thoughts that haven’t found a home yet.
I’m not going to cover any reasons why you should or shouldn’t do this. I’m assuming at this point that the decision to do this has already been made by you or for you by whoever pays you. If you need some facts to support doing this then look for blogs/articles/etc that discusses the performance degradation of storing large image files in SQL Server. If you need some facts to support not doing this then point out that this requires COM and will be addressed differently in the next version of SharePoint.
NOTE: I won’t be able give you the full source code because my employer won’t let me. What I will be able to do is to give you advice and some of the critical lines of code. It’s really not that hard to implement once you figure out the nuances and I’ll tell you all of the nuances that I figured out.
Now sit back, relax, and get ready to experience my most recent cure for insomnia and excitement.
Microsoft’s Information
In this section I’ll outline all of the information that I have been able to find from Microsoft as well as how things work from what I’ve been able to determine. I don’t work for Microsoft (not because I haven’t wanted to, just because I haven’t been able to get hired…of course now because of a divorce I can’t move to Redmond so what’s the point) nor do I don’t have access to inside people or resources so I may have missed something. Take what you read here as my best effort with which I’m satisfied with the result, but not the gospel. If you know something I missed feel free to leave a comment and as long as you don’t make me look like an idiot I’ll probably approve it :-).
History and Links
In May 2007, Microsoft released a hotfix for Windows SharePoint Services 3.0 (KB937901: http://support.microsoft.com/kb/937901) that exposed an external storage API (KB938499: http://support.microsoft.com/kb/938499). Subsequently this hotfix was rolled into Service Pack 1 for WSS 3.0. I tried to find just the hotfix to download, but apparently when Microsoft created the Service Pack they got rid of the hotfix download. Be prepared to install and test this if it isn’t already installed in your organization.
The implementation documentation for the external storage API can be found at http://msdn2.microsoft.com/en-us/library/bb802976.aspx. This document provides a good starting point, conceptual ideas, and advice. The technical accuracy of the content, especially in the areas of the IDL for the interface and installing the component, is somewhat misleading. These areas will be correctly articulated throughout in this blog series. I know, it is a pretty dry read, but, as you can tell from this blog, it’s hard to make this stuff exciting. I mean it’s a COM interface implementation after all…I guess we could beg Don Box to write about it.
I suggest that you take a few minutes and read the information on those links right now. The rest of this blog series will be written assuming that you have already read and understood that information. Don’t worry; I’ll wait for you to get back.
SharePoint Storage Architecture
Now that you’ve finished reading that information, I’m going to tell you what I was able to figure out as far as to how things are really working behind the scene.
Out of the box WSS stores all binary content in the application’s Content database in the AllDocStreams.Content column (which is an image type). When the EBS is implemented and attached to the SharePoint Farm then only the value returned as the Binary ID is stored in this column.
The SharePoint Farm will marshal all new or updated information to the EBS. Existing data can be migrated to the EBS in one of two methods (as discussed in the Operational Limits and Trade-Off Analysis document at http://msdn2.microsoft.com/en-us/library/bb862135.aspx):
1. Perform a site level back and restore. During the restore SharePoint will send the BLOBs to the provider.
2. Leave the current data in the SQL Server Content database and allow all new or updated content to be stored on the disk. Eventually through the attrition of updates or deletions all database content should be purged.
The implementation of this functionality requires at a minimum two pieces: a COM component that implements the ISPExternalBinaryProvider interface and an application to clean up orphaned binary files.
The COM Component
As outlined in the Microsoft implementation documentation a COM compatible component must be created that provides an implementation of the ISPExternalBinaryProvider interface and it’s StoreBinary and RetrieveBinary methods. The communication with SharePoint through these methods involves the following:
• Partition ID: This represents the site collection ID and is a GUID. The actual parameters are a pointer and a size that must be converted to a GUID.
• Binary ID: This is how the provider tells SharePoint to reference the BLOB file and can be any value that you decide (String, GUID, Int64, etc). The provider creates this value on the StoreBinary and receives it from SharePoint on the RetrieveBinary. The parameters for this are also a pointer and a size value.
• BLOB Bytes: This parameter represents the bytes for the BLOB. The actual parameter is an ILockBytes interface and the bytes must be copied to/from the object into a local byte[].
The Orphaned File Cleanup Application
Also outlined in the Microsoft implementation documentation is a lazy garbage collection application that is periodically run that gets the list of external storage Id’s for a given site and deletes any files in the file system that aren’t in the referenced Id’s from SharePoint. This is a pretty straight forward process of opening a site, getting a list of Id’s, looping through the files in the file system, and deleting files not in the site’s Id list.
Architecture Decisions
It should be pretty clear by now that you at least need a COM component and an application that can access the SharePoint object model and delete files.
At this point you need to stop and decide some things. What technology/language are you going to implement the COM object in? Are you going to follow any particular patterns? Will the logic for where/how to access the files be duplicated between the COM component and the cleanup application or will they share a component for that? How are you going to do configuration? How are you going to handle errors and exceptions?
At this point some of us may be parting ways because of difference to the answers to these questions. The rest of this blog series is written from the viewpoint that you have decided to use C#/.NET 2.0 (therefore Visual Studio 2005) to implement this code, that the Provider Pattern would not be followed (I’ll explain this more in a minute), and that the COM component and the cleanup process will share a component to manage the binary store. Also, the standard App.Config file will be used for configuration settings and I’m going to leave the error and exception handling up to you since that is different for every organization. Based on your decisions your mileage for the usefulness of this article will vary, but feel free to stay with us.
So, for the remainder of this blog series my solution has three parts. The first is an EBS provider class assembly that implements the ISPExternalBinaryProvider interface and handles all of the COM interaction. The second is an EBS file manager class assembly that implements the storage and retrieval logic as well as some extended store management methods for the orphaned object cleanup. The final part is the EBS orphaned file remover console application that connects to SharePoint and removes all physical BLOBs that SharePoint no longer has a reference to.
I chose to isolate the COM interactions in the EBS Provider component and isolate the actual file management in a shared component. I also chose to pass the information between the two as a byte[]. You are free to implement this however you want.
The EBS provider will implement the public StoreBinary and RetrieveBinary methods and will need to provide a way to read a byte[] out of an ILockBytes, provide a way to write a byte[] into an ILockBytes, and provide a way dereference memory given a byte pointer and a size.
The EBS file manager needs to provide a way to determine where the files should be written, a way to write a byte[] to disk, a way to read a byte[] from disk, a way to get a list of file given a provider Id, and a way to delete the orphaned files. Also, this piece will need to be configurable so that you can specify where the files should be stored.
The EBS orphaned file cleaner application needs to have a way to read a configurable list of web sites to clean. For each site it needs to get the list of external binary file Ids and delete the files that exist that are no longer referenced.
I told you that I’d explain the Provider Pattern decision (for those not familiar with the Provider Pattern, please read up on it here as it is a useful pattern for configurable software: theory @ http://msdn2.microsoft.com/en-us/library/ms972319.aspx & implementation @ http://msdn2.microsoft.com/en-us/library/ms972370.aspx). The main reason that I didn’t implement this pattern is that I didn’t and still can’t see a need for another provider other than disk based storage. After I completed my implementation I ran across the an implementation of an EBS by kaneboy (Codeplex site: http://www.codeplex.com/ebs) and noticed that he is using this pattern. If you have a requirement for this or can think of some other way you want to store this then you may want to add the provider pattern to your solution.
Summary
So far in this series we have discussed the exposure of an external storage API from Window SharePoint Services, Microsoft’s implementation documents, what I’ve figured out as it relates to the implementation under the covers, the architectural decisions that you are faced with, and the architectural decisions that I’ve made for this blog series. The next blog will cover the information related to implementing the COM interface. The following blog entry will cover the file management component and orphaned file cleanup process.