How to html-encode special characters in T-SQL

This is a supporting post for a rather lengthy article about How to transform any query to Html (scheduled for 2014-12-16).

The case for generating html output directly in T-Sql

It is a widely discussed idea in the community that all server side data handling with T-SQL should not consider issues of data presentation – i.e. how it should be rendered on clients. I agree mostly, but there is at least one exception to this rule – when using database mail to send query results to clients.

When talking to people other than DBAs you need to transport the message contained in your data. Marketing or sales people expect formatted output and the only way to process query results for database mail is to transform them to html.

The problem of special characters

Of course, when transforming row content as string values to html you need to transform any contained special characters as well – a simple example being the frequent Umlauts in German or accented letters in French. 

Unfortunately there is no inbuilt function to html-encode special characters in T-SQL. There are several ways to address the problem. One way would be to create a clr-integrated assembly employing .NET-framework functions – potentially an elegant solution since the transform could be handled with a simple call to HtmlEncode in System.Web – but the System.Web namespace cannot be integrated in clr-assemblies. 

So I decided to work on a semi-complete solution which is written natively in T-SQL and should do for most practical use cases (at least for mine).

1. Creating a table for Html Entities

My solution employs a table to provide replacements for common characters. The first step is to create this table, using a stored procedure. For this I adapted a somewhat complete list of common special characters from Wikipedia.

 The full-blown code can be downloaded at the bottom of this article.

What I’m essentially doing here is creating a lookup table containing different representations of special characters:

  • Column charValue shows the special character itself.
  • Column htmlEntity contains the replacement to be used in html output
  • Column UnicodeHex shows the Unicode Hex Value as string
  • Column UnicodeDec shows the Unicode Hex Value in decimal form
  • Column CodePointHex shows the Unicode Codepoint in Hex

The initial Inserts after table creation only takes in the columns htmlEntity, UnicodeHex and description as literal values. At the end I calculate UnicodeDecimal and derive the charValue from it – avoiding character misrepresentations due to pasting literals into the SSMS code editor.

Before I forget: this Stored Procedure needs the HexStrToVarBinary-Function provided by Micheal Letterle in his post T-SQL Hex string to VarBinary improved. I include it as courtesy in the download below.

2. A Function to html-encode special characters

The second step is to create a function to simplify the html-encoding itself.

 A first try with this function was not as satisfying as desired. I had to play around with the ordering of html entities in the replacement table to achieve a proper encoding in the end. And to encode upper and lower case variants correctly it is important to specify the collation as case sensitive with COLLATE Latin1_General_CS_AS.

 3. Usage and testing for correctness 

To ensure that the encoding is working as expected I wrote a short test. The testing strategy is to html-encode all special characters in a single string and compare them to a string generated directly from their html counterparts.

 First I execute the stored procedure to update table htmlEntities. Then I create a string containing all special characters in the table – converted on the fly with NCHAR(UnicodeDec) to avoid mis-representations. Step 3 is to build the expected conversion result – a string concatenated from all corresponding html entities . Then the specialChars are html-encoded, compared to the expected result and a success- or failure-message is printed.

I had to repeat this test several times to iron out some encoding problems related to the ordering of replacement entries – but now it works.

Hope this will help you with similar tasks at hand.

To spare you the typing: Here’s the Code to Download.

 


References: