External BLOB/Binary Store for Windows SharePoint Services 3.0 in C#/.NET 2.0 - Part II

Part II – The COM Component

Overview

In the previous post in this series we discussed the exposure of an external storage API from Window SharePoint Services, Microsoft’s implementation documents, what I have been able to figure out as it relates to the implementation under the covers, the architectural decisions that you must make, and the architectural decisions that I’ve made for this blog series.  If you have not read it, please make sure that you do before you continue.

We are going to focus on the COM Component in this blog entry.  This is easily the most important and most difficult piece of this whole solution which makes this one the longest blog in the series…sorry.  This is also the most technical and detailed entry so I’m going to try my best to hold back on the sarcasm, but as a result this one will be very dry (not like the previous one was much better). 

For those who aren’t going to read the previous blog, all I’m going to tell you is that we are doing this as a C# .NET 2.0 solution.

Now, back to my attempted cure for insomnia and excitement…

The EBS Provider

To implement the COM component for the EBS provider you will need to create a C# class project, prepare the class project for GAC installation, create the interface files from the IDL, prepare the provider class for COM Interop, implement the interface methods, create the ILockByte support methods, and create the memory dereferencing method.

Create the C# Class Project
I’m assuming that you know how to create a C# Class project.  If not, then you can read the Microsoft docs here: http://msdn2.microsoft.com/en-us/library/ms173077(VS.80).aspx.

GAC Settings
For this you will need to add your key file to the project and then in the project properties on the “Signing” tab check the “Sign the assembly” checkbox and select the key file in the “Choose a strong name key file:” dropdown.  For further information on this see Global Assembly Cache concepts at http://msdn2.microsoft.com/en-us/library/yf1d93sz(VS.80).aspx

To help with deployment and development you should consider setting these values on the “Build Events” tab:
Pre-build event command line:
"$(DevEnvDir)..\..\SDK\v2.0\bin\gacutil" /u "$(TargetName)"

Post-build event command line:
"$(DevEnvDir)..\..\SDK\v2.0\bin\gacutil" /i "$(TargetPath)"

Run the post-build event:
On successful build

This will remove the project from the GAC before the build and add the project to the GAC after a successful build.

Interface Implementation
There are two interfaces that must be implemented for this component.  They are the ISPExternalBinaryProvider and the ILockBytes interfaces.

Here is the IDL for the ISPExternalBinaryProvider as provided in the Microsoft implementation documentation (http://msdn2.microsoft.com/en-us/library/bb802811.aspx):

/*************************************************
    File: extstore.idl
    Copyright (c): 2006 Microsoft Corp.
*************************************************/
import "objidl.idl";

[
    uuid(39082a0c-af6e-4ac2-b6f0-1a1ff6abbae1)
]

library SharePointBinaryStore
{
    [
        local,
        object,
        uuid(48036587-c8bc-4aa0-8694-5a7000b4ba4f),
        helpstring("ISPExternalBinaryProvider interface")
    ]
    interface ISPExternalBinaryProvider : IUnknown
    {
        HRESULT StoreBinary(
            [in] unsigned long cbPartitionId,
            [in, size_is(cbPartitionId)] const byte* pbPartitionId,
            [in] ILockBytes* pilb,
            [out] unsigned long* pcbBinaryId,
            [out, size_is(, *pcbBinaryId)] byte** ppbBinaryId,
            [out,optional] VARIANT_BOOL* pfAccepted);

        HRESULT RetrieveBinary(
            [in] unsigned long cbPartitionId,
            [in, size_is(cbPartitionId)] const byte* pbPartitionId,
            [in] unsigned long cbBinaryId,
            [in, size_is(cbBinaryId)] const byte* pbBinaryId,
            [out] ILockBytes** ppilb);
    }
}

 

For my implementation I took this IDL and ran it through the MIDL compiler (midl.exe) to get a type library and then through the Type Library Importer (tlbimp.exe) to get an assembly.  Using the IDL file that way created a bunch of gross looking code that was a pain to work with. I took some time through trial and error and came up with the following interface representations for both the ISPExternalBinaryProvider and the ILockBytes that work in a .NET implementation.  I think these are much cleaner and easier to work with.  By the way, each of these where in their own .cs file without any namespace information.

[ComImport, ComConversionLoss, Guid("48036587-C8BC-4AA0-8694-5A7000B4BA4F"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
public interface ISPExternalBinaryProvider
{
[MethodImpl(MethodImplOptions.InternalCall, 
MethodCodeType=MethodCodeType.Runtime)]
    void StoreBinary([In] uint cbPartitionId, 
                     [In] ref byte pbPartitionId, 
                     [In, MarshalAs(UnmanagedType.Interface)] ILockBytes pilb, 
                     out uint pcbBinaryId, 
                     out IntPtr ppbBinaryId, 
                     [Optional] out bool pfAccepted);

    [MethodImpl(MethodImplOptions.InternalCall, 
MethodCodeType = MethodCodeType.Runtime)]
    void RetrieveBinary([In] uint cbPartitionId, 
                        [In] ref byte pbPartitionId, 
                        [In] uint cbBinaryId, 
                        [In] ref byte pbBinaryId, 
                        [MarshalAs(UnmanagedType.Interface)] out ILockBytes ppilb);
}

[ComImport, Guid("0000000A-0000-0000-C000-000000000046"), InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
public interface ILockBytes
{
    [MethodImpl(MethodImplOptions.InternalCall, 
        MethodCodeType=MethodCodeType.Runtime)]
    void ReadAt([In] UInt64 ulOffset, 
                [In] IntPtr pv, 
                [In] uint cb, 
                [Out] out uint pcbRead);

    [MethodImpl(MethodImplOptions.InternalCall, 
MethodCodeType = MethodCodeType.Runtime)]
    void WriteAt([In] UInt64 ulOffset, 
                 [In] IntPtr pv, 
                 [In] uint cb, 
                 [Out] out uint pcbWritten);

    [MethodImpl(MethodImplOptions.InternalCall, 
MethodCodeType = MethodCodeType.Runtime)]
    void Flush();

    [MethodImpl(MethodImplOptions.InternalCall, 
        MethodCodeType=MethodCodeType.Runtime)]
    void SetSize([In] UInt64 cb);

    [MethodImpl(MethodImplOptions.InternalCall, 
        MethodCodeType=MethodCodeType.Runtime)]
    void LockRegion([In] UInt64 libOffset, 
                    [In] UInt64 cb, 
                    [In] uint dwLockType);

    [MethodImpl(MethodImplOptions.InternalCall, 
        MethodCodeType=MethodCodeType.Runtime)]
    void UnlockRegion([In] UInt64 libOffset, 
                      [In] UInt64 cb, 
                      [In] uint dwLockType);

    [MethodImpl(MethodImplOptions.InternalCall, 
        MethodCodeType=MethodCodeType.Runtime)]
    void Stat([Out, MarshalAs(UnmanagedType.Struct)] out STATSTG pstatstg, 
              [In] int grfStatFlag);
}

 

One thing that I want to point out now is the difference in method declaration for the StoreBinary and RetrieveBinary in the interfaces above versus the IDL and Microsoft’s documentation.  The documentation says that the methods must return an HRESULT of S_OK or E_FAIL, but the methods in the interface are declared as void.  The reason for this is that when I tried to implement them to return the HRESULT it caused the process to fail miserably and causes SharePoint to hang.  When I changed them to void and stopped returning values, then everything worked well.

COM Interop Preparation
In the project properties you will need to check the “Register for COM Interop” checkbox on the “Build” tab.

The provider class must inherit the ISPExternalBinaryProvider interface and have a public default constructor even if it is empty.  Also, you will need to set these class attributes.

•    [ProgId("[Object Name].[Class Name]")]
This is the ProgID for the COM class.  Details on the ProgIdAttribute class can be found at http://msdn2.microsoft.com/en-us/library/system.runtime.interopservices.progidattribute(VS.80).aspx.

•    [Guid("00000000-0000-0000-0000-000000000000")]
This is a new GUID generated using the GuidGen.exe tool shipped with Visual Studio. Detail on the GuidAttribute class can be found at http://msdn2.microsoft.com/en-us/library/system.runtime.interopservices.guidattribute(VS.80).aspx.

•    [ClassInterface(ClassInterfaceType.None)]
Defines the class interface type and for this project I used the above values.  Details on the ClassInterfaceAttribute call can be found at http://msdn2.microsoft.com/en-us/library/system.runtime.interopservices.classinterfaceattribute(VS.80).aspx.

 

Implement the Interface Methods
You must explicitly implement the interface members.  The easiest way to do this is let Visual Studio do this for you.  If you hover your mouse over the ISPExternalBinaryProvider after the : in the class definition you will notice a little blue rectangle/line under the “I”.  If you click on that or press “Shift+Alt+F10” you will get a couple of options including “Explicitly implement interface ‘ISPExternalBinaryProvider’”.  When it has finished it should look like this: 

void ISPExternalBinaryProvider.RetrieveBinary(… 

 

Notice that there is no explicit scope (public, private, internal, etc) on the method definitions.  That’s how they should be so don’t change it.

In the RetrieveBinary method you will basically need to dereference the partition Id, the binary Id, retrieve the byte[] from the EBS file manager, create an ILockBytes objects, and set the ppilb output parameter to the ILockBytes object.  I recommend doing all of this in a try/catch block and because of this you will need to explicitly set the ppilb equal to null at the top of the method to get the component to compile.

In the StoreBinary method you will need to dereference the partition Id, read the byte[] out of the ILockBytes, write the byte[] to a file associated with the partition Id which should create a new binary Id for you, dereference the binary Id into the pcbBinaryId for the size and the ppbBinaryId for the first byte of the binary Id, and finally you need to set the pfAccepted to true if you were able to write the file or false if SharePoint should take care of writing the file.  Again, this should be done in a try/catch block and the three output parameters should be set to default values.  Something to remember is that to dereference the binary Id going back to SharePoint that you should first convert it to a byte[].

In the following two sections I’ll discuss how to read the byte[] from and write the byte[] to an ILockBytes object and how to dereference the memory for the Id pointers.

ILockByte Support
The ILockBytes interface supports 3 methods that we use in this process: Stat, ReadAt, and WriteAt.  We don’t need to use any of the other exposed methods.

To read a byte[] from an ILockBytes interface you need to create an memory buffer using Marshal.AllocHGlobal([buffer size]) (I used 8192 for the buffer size), call the Stat method to get the number of bytes in the ILockBytes, create the a byte[] equal to the resulting size, and loop through reading the bytes from the ILockBytes using ReadAt and write them into the resulting byte[] (shown below).  Don’t forget to free the memory using Marshal.FreeHGlobal([buffer variable]).  So...I was wanting to give you the method for reading the bytes from an ILockBytes but that request was denied by my employer (I know, I don't know why either).  Anyway I found a way to do this in the Memory generation of Excel files.

do
{
    lockBytes.ReadAt(offset, buf, (UInt32)8192, out bytesRead);
    if (bytesRead > 0)
    {
        Marshal.Copy(buf, bytes, (Int32)offset, (Int32)bytesRead);
        offset += bytesRead;
    }
} while (bytesRead > 0);

 

To write a byte[] to an ILockBytes interface you will need to reference an external method in the OLE32.dll called CreateILockBytesOnHGlobal.  Here is the code for that declaration:

[DllImport("ole32.dll")] static extern int CreateILockBytesOnHGlobal(IntPtr hGlobal,
                                                                     bool fDeleteOnRelease,

                                                                     out ILockBytes ppLockbytes); 
 

Now that you have the external definition you can continue with writing a byte[] to an ILockBytes. To write the byte[] you need to create the resulting ILockBytes object, get the size of the byte[], call the CreateILockBytesOnHGlobal, allocate a buffer, and loop through writing the bytes into the ILockBytes using the WriteAt method (shown below).  Again, I was hoping to be able to give you the entire method, but that request was denied.  For this one, basically take the opposite of what you've done for reading from the ILockBytes.

while (byteSize > 0)
{
     bytesRead = (byteSize > 8192 ? 8192 : (Int32)byteSize);
     Marshal.Copy(bytes, (Int32)offset, buf, bytesRead);
     lockBytes.WriteAt(offset, buf, (UInt32)bytesRead, out bytesWritten);
     if (bytesWritten == 0)
     {
         throw new ApplicationException("Unable to write to contents");
     }
     offset += bytesWritten;
     byteSize -= bytesRead; 
} 

 

That pretty much covers reading a byte[] from and writing a byte[] to an ILockBytes interface.

Memory Dereferencing
The last thing that you need to do in this component is to have some way to dereference the pointers coming in from SharePoint and going out to SharePoint.  The only way to achieve this that I could find is to use unsafe code and in order to do that you will need to edit the project properties and select the “Allow unsafe code” in the “General” section on the “Build” tab.

To dereference the pointer coming in from SharePoint you will need to create a byte[] buffer to the size indicated, create a byte* equal to the incoming byte from SharePoint using the fixed keyword, and then perform a memory copy (shown below).  I chose to make this a method since it is needed multiple times and since it is a method using the fixed keyword it has to be flagged as unsafe. 

fixed (byte* refBytes = &bytes)
{
     Marshal.Copy(new IntPtr(refBytes), buffer, 0, (Int32)size); 
} 

 

To dereference the newly created binary Id going back to SharePoint you need to convert the Id to a byte[],  set the pcbBinaryId to the number of bytes in the Id, allocate the memory on the heap, and then copy the bytes into the allocated memory using the Marshal.Copy method.  Since this is needed only once I left this code in the StoreBinary method.  Here is the snippet:

pcbBinaryId = (UInt32)binaryIdBytes.Length;
ppbBinaryId = Marshal.AllocHGlobal((Int32)pcbBinaryId); 
Marshal.Copy(binaryIdBytes, 0, ppbBinaryId, (Int32)pcbBinaryId); 

 

You will notice that we are allocating memory on the heap without releasing it.  If we released it then SharePoint wouldn’t get our id back out.  This is pretty much undocumented in its entirety so I’m hoping the SharePoint is freeing this memory when it is finished with it, otherwise we will have a memory leak here.  Likewise, it isn’t clear who’s supposed to free the memory for the incoming partition Id, so we may have a memory leak there. (Believe me, I’ll rant about this and many other things in the Final Thoughts section of Part IV).

Summary

In this entry we covered all of the technical details for creating the COM Component.  You now know how to setup the COM Interop project, implement the required interface, get information in to and out of the ILockBytes interface, and dereference the memory for the values being passed back and forth with SharePoint.

In the next blog I’ll cover the details for implementing a file manager and an orphan file cleanup process.

External BLOB/Binary Store for Windows SharePoint Services 3.0 in C#/.NET 2.0 - Part I

Part I – The Background

Overview

I recently had the task of writing an External BLOB/Binary Store (EBS) implementation for Windows SharePoint Services v3.0 SP1.  In this series of blog entries I want to pass on what I’ve learned during that process.  I’m going to go over the information available from Microsoft, the architectural decisions that you have to make, things to consider when implementing the COM interface, and finally things to consider when implementing the orphaned binary file cleanup process.  I’ll also cover some debugging tips for this process and things to consider for deployment. 

Due to the sheer amount of information I’m going to break this down into the four blog entries.  The first is the background information (this blog entry) that explains what to expect, introduces the Microsoft documentation, and discusses the architectural decisions that need to be made as well as which ones I made for this series.  The next two blogs will be discussing the different architectural areas in (sometimes painful) detail.  The final blog will be on deployment, debugging, and any other random thoughts that haven’t found a home yet.

I’m not going to cover any reasons why you should or shouldn’t do this.  I’m assuming at this point that the decision to do this has already been made by you or for you by whoever pays you.  If you need some facts to support doing this then look for blogs/articles/etc that discusses the performance degradation of storing large image files in SQL Server.  If you need some facts to support not doing this then point out that this requires COM and will be addressed differently in the next version of SharePoint. 

NOTE: I won’t be able give you the full source code because my employer won’t let me.  What I will be able to do is to give you advice and some of the critical lines of code.  It’s really not that hard to implement once you figure out the nuances and I’ll tell you all of the nuances that I figured out. 

Now sit back, relax, and get ready to experience my most recent cure for insomnia and excitement. 

Microsoft’s Information

In this section I’ll outline all of the information that I have been able to find from Microsoft as well as how things work from what I’ve been able to determine.  I don’t work for Microsoft (not because I haven’t wanted to, just because I haven’t been able to get hired…of course now because of a divorce I can’t move to Redmond so what’s the point) nor do I don’t have access to inside people or resources so I may have missed something.  Take what you read here as my best effort with which I’m satisfied with the result, but not the gospel.  If you know something I missed feel free to leave a comment and as long as you don’t make me look like an idiot I’ll probably approve it :-).

History and Links
In May 2007, Microsoft released a hotfix for Windows SharePoint Services 3.0 (KB937901: http://support.microsoft.com/kb/937901) that exposed an external storage API (KB938499: http://support.microsoft.com/kb/938499).  Subsequently this hotfix was rolled into Service Pack 1 for WSS 3.0.  I tried to find just the hotfix to download, but apparently when Microsoft created the Service Pack they got rid of the hotfix download.  Be prepared to install and test this if it isn’t already installed in your organization.

The implementation documentation for the external storage API can be found at http://msdn2.microsoft.com/en-us/library/bb802976.aspx. This document provides a good starting point, conceptual ideas, and advice.  The technical accuracy of the content, especially in the areas of the IDL for the interface and installing the component, is somewhat misleading.  These areas will be correctly articulated throughout in this blog series.  I know, it is a pretty dry read, but, as you can tell from this blog, it’s hard to make this stuff exciting.  I mean it’s a COM interface implementation after all…I guess we could beg Don Box to write about it.

I suggest that you take a few minutes and read the information on those links right now.  The rest of this blog series will be written assuming that you have already read and understood that information.  Don’t worry; I’ll wait for you to get back.

SharePoint Storage Architecture
Now that you’ve finished reading that information, I’m going to tell you what I was able to figure out as far as to how things are really working behind the scene.

Out of the box WSS stores all binary content in the application’s Content database in the AllDocStreams.Content column (which is an image type).  When the EBS is implemented and attached to the SharePoint Farm then only the value returned as the Binary ID is stored in this column.  

The SharePoint Farm will marshal all new or updated information to the EBS.  Existing data can be migrated to the EBS in one of two methods (as discussed in the Operational Limits and Trade-Off Analysis document at http://msdn2.microsoft.com/en-us/library/bb862135.aspx):
1.    Perform a site level back and restore.  During the restore SharePoint will send the BLOBs to the provider.
2.    Leave the current data in the SQL Server Content database and allow all new or updated content to be stored on the disk.  Eventually through the attrition of updates or deletions all database content should be purged.

The implementation of this functionality requires at a minimum two pieces: a COM component that implements the ISPExternalBinaryProvider interface and an application to clean up orphaned binary files. 

The COM Component
As outlined in the Microsoft implementation documentation a COM compatible component must be created that provides an implementation of the ISPExternalBinaryProvider interface and it’s StoreBinary and RetrieveBinary methods.  The communication with SharePoint through these methods involves the following:
•    Partition ID: This represents the site collection ID and is a GUID.  The actual parameters are a pointer and a size that must be converted to a GUID.
•    Binary ID: This is how the provider tells SharePoint to reference the BLOB file and can be any value that you decide (String, GUID, Int64, etc).  The provider creates this value on the StoreBinary and receives it from SharePoint on the RetrieveBinary. The parameters for this are also a pointer and a size value.
•    BLOB Bytes: This parameter represents the bytes for the BLOB.  The actual parameter is an ILockBytes interface and the bytes must be copied to/from the object into a local byte[].

The Orphaned File Cleanup Application
Also outlined in the Microsoft implementation documentation is a lazy garbage collection application that is periodically run that gets the list of external storage Id’s for a given site and deletes any files in the file system that aren’t in the referenced Id’s from SharePoint.  This is a pretty straight forward process of opening a site, getting a list of Id’s, looping through the files in the file system, and deleting files not in the site’s Id list. 

Architecture Decisions

It should be pretty clear by now that you at least need a COM component and an application that can access the SharePoint object model and delete files.

At this point you need to stop and decide some things. What technology/language are you going to implement the COM object in?  Are you going to follow any particular patterns? Will the logic for where/how to access the files be duplicated between the COM component and the cleanup application or will they share a component for that? How are you going to do configuration?  How are you going to handle errors and exceptions?

At this point some of us may be parting ways because of difference to the answers to these questions.  The rest of this blog series is written from the viewpoint that you have decided to use C#/.NET 2.0 (therefore Visual Studio 2005) to implement this code, that the Provider Pattern would not be followed (I’ll explain this more in a minute), and that the COM component and the cleanup process will share a component to manage the binary store.  Also, the standard App.Config file will be used for configuration settings and I’m going to leave the error and exception handling up to you since that is different for every organization.  Based on your decisions your mileage for the usefulness of this article will vary, but feel free to stay with us.

So, for the remainder of this blog series my solution has three parts.  The first is an EBS provider class assembly that implements the ISPExternalBinaryProvider interface and handles all of the COM interaction.  The second is an EBS file manager class assembly that implements the storage and retrieval logic as well as some extended store management methods for the orphaned object cleanup.  The final part is the EBS orphaned file remover console application that connects to SharePoint and removes all physical BLOBs that SharePoint no longer has a reference to.

I chose to isolate the COM interactions in the EBS Provider component and isolate the actual file management in a shared component.  I also chose to pass the information between the two as a byte[].  You are free to implement this however you want.

The EBS provider will implement the public StoreBinary and RetrieveBinary methods and will need to provide a way to read a byte[] out of an ILockBytes, provide a way to write a byte[] into an ILockBytes, and provide a way dereference memory given a byte pointer and a size. 

The EBS file manager needs to provide a way to determine where the files should be written, a way to write a byte[] to disk, a way to read a byte[] from disk, a way to get a list of file given a provider Id, and a way to delete the orphaned files.  Also, this piece will need to be configurable so that you can specify where the files should be stored.

The EBS orphaned file cleaner application needs to have a way to read a configurable list of web sites to clean.  For each site it needs to get the list of external binary file Ids and delete the files that exist that are no longer referenced.

I told you that I’d explain the Provider Pattern decision (for those not familiar with the Provider Pattern, please read up on it here as it is a useful pattern for configurable software: theory @ http://msdn2.microsoft.com/en-us/library/ms972319.aspx & implementation @ http://msdn2.microsoft.com/en-us/library/ms972370.aspx).  The main reason that I didn’t implement this pattern is that I didn’t and still can’t see a need for another provider other than disk based storage.  After I completed my implementation I ran across the an implementation of an EBS by kaneboy (Codeplex site: http://www.codeplex.com/ebs) and noticed that he is using this pattern.  If you have a requirement for this or can think of some other way you want to store this then you may want to add the provider pattern to your solution. 

Summary

So far in this series we have discussed the exposure of an external storage API from Window SharePoint Services, Microsoft’s implementation documents, what I’ve figured out as it relates to the implementation under the covers, the architectural decisions that you are faced with, and the architectural decisions that I’ve made for this blog series.  The next blog will cover the information related to implementing the COM interface.  The following blog entry will cover the file management component and orphaned file cleanup process.