codeflood logo

Sitecore API vs SQL

It’s inevitable, whenever you talk to techies new to Sitecore, at some point someone is going to ask to look inside the database at the database schema. I always cringe when they ask, cause I know where they’re headed. The thing with Sitecore is you have an abstracted data model that you can think in terms of. You don’t (and probably shouldn’t) think about the data in terms of what is in the database. Don’t get me wrong, it’s good to understand the database, but don’t limit your thinking to it.

This abstracted data model in Sitecore is the content tree. Breaking down into it’s constituent component you have the item, which contains a collection of fields. Realistically that’s as low as you need to go into the data model of Sitecore. It’s more efficient to think in these terms rather than the database tables.

In fact, the database schema in Sitecore is modelled on the content tree, not the other way round. I guess this is a throw back to the old days when Sitecore stored it’s content not in a database, but in an XML file. And you can see evidence of this in the database schema.

One thing that sets Sitecore apart from many other CMSs is the fact it has a really good API. You can do anything that Sitecore themselves do in the CMS using the API including creating and updating items in the content tree. And you really should use the API whenever you need to interact with the data in Sitecore.

There are many benefits to using the API as opposed to accessing the database directly using SQL. To name just a few:

Data Providers

The Sitecore data model provides a pluggable interface to allow storage and retrieval of content from any data store including different RDBMSs. Out of the box Sitecore contains data providers for Microsoft SQL Server, Oracle and MySQL. If you write any code directly against the database you limit the databases you can attach your site to. Better to use the API and gain the benefit of being able to swap to a different RDBMS.

Caching

The data providers contain internal caching to speed up your queries and prevent unnecessary calls to the database server. If you connect directly to the database you’ll have to implement your own caching.

More Efficient Coding

Thinking about the content at a higher level and not getting bogged down in the fine detail of it’s implementation allows you to think more efficiently about what you’re doing. The code is also a lot easier. Here is some quick example code of how to create a new item in Sitecore and populate it’s fields.

var db = Sitecore.Configuration.Factory.GetDatabase("master");
var target = db.GetItem("/sitecore/content/home");
var template = db.Templates["My Template"];

var item = target.Add("New Item", template);
item.Editing.BeginEdit();
item["title"] = "The Title";
item["text"] = "The Text";
item.Editing.EndEdit();

Now compare that to the SQL you’d need to write for SQL Server.

declare 
  @parentid uniqueidentifier, 
  @id uniqueidentifier

select @parentid = ID from Items 
  where Name = 'Home' and ParentID = ( 
    select ID from Items 
      where Name = 'content' 
      and ParentID = '11111111-1111-1111-1111-111111111111')

set @id = NEWID() 
insert into items values (@id, 'New Item', '76036F5E-CBCE-46D1-AF0A-4143F9B557AA', 
  '00000000-0000-0000-0000-000000000000', @parentid, GETDATE(), GETDATE()) 

insert into versionedfields values(NEWID(), @id, 'en', 1, 
  '75577384-3C97-45DA-A847-81B00500E250', 'The Title', GETDATE(), GETDATE()) 
insert into versionedfields values(NEWID(), @id, 'en', 1, 
  'A60ACD61-A6DB-4182-8329-C957982CEC74', 'The Text', GETDATE(), GETDATE()) 
insert into versionedfields values(NEWID(), @id, 'en', 1, 
  '5DD74568-4D4B-44C1-B513-0AF5F4CDA34F', 'sitecore\admin', GETDATE(), GETDATE()) 
insert into versionedfields values(NEWID(), @id, 'en', 1, 
  'BADD9CF9-53E0-4D0C-BCC0-2D784C282F6A', 'sitecore\admin', GETDATE(), GETDATE()) 
insert into versionedfields values(NEWID(), @id, 'en', 1, 
  '25BED78C-4957-4165-998A-CA1B52F67497', '20100907T080000', GETDATE(), GETDATE()) 
insert into versionedfields values(NEWID(), @id, 'en', 1, 
  'D9CF14B1-FA16-4BA6-9288-E8A174D4D522', '20100907T080000', GETDATE(), GETDATE()) 
insert into versionedfields values(NEWID(), @id, 'en', 1, 
  '52807595-0F8F-4B20-8D2A-CB71D28C6103', 'sitecore\admin', GETDATE(), GETDATE())

The C# code is much more readable. And in fact, the SQL code above is somewhat simplified cause I cheated and used the Sitecore UI to grab those field GUIDs rather than using more queries to find them.

But as with anything in development, the approach you take for anything…depends. But I’ve just given all these reasons above why you should use the API. Under what condition could you possible still consider a direct SQL approach? Data migration.

Data migration is a somewhat unique activity on a project. It’s normally only done once at the start of the project and isn’t ever used again for the normal life of the solution. That means you’re not likely to have to come back and update or maintain the code, so all the best practises can be thrown out the window en lieu of a quick and dirty solution to just get the job done.

So why do people want to use a direct SQL approach? Speed. It makes sense that accessing the database directly would be quicker than going through another layer of indirection; the API. But what performance gains can we expect from a direct SQL approach? Well I’ve done some testing to find out.

I’ve had to make some assumptions to frame the testing approach I used. I assumed the data for the migration would be in a file. I could iterate over each record in the file and either call the appropriate APIs to create the item or generate TSQL on the fly. The data being imported includes links to reference items which must be looked up and if they don’t exist they must be created. This approach is a more realistic example of what might be done in reality.

In addition to pitting direct SQL against the API I also split testing of the SQL approach into two. While I’m generating TSQL on the fly I have the opportunity to run the TSQL in batches rather than item by item. This will save network trips to the database server but in a real world scenario you may prefer not to batch so if a single record fails it doesn’t fail the whole batch.

For each of the approaches to be tested (SQL single, SQL batch, API) I took the average time to create a number of items and filling in their fields ranging from 25 up to 2000. And here are the results.

image

Wow! I wasn’t expecting that. My prediction before I started this testing was that direct SQL should perform better always, but I was unsure of the gain to expect through using this approach. As you can see from the above results, this is not the case. The API actually performed better than the direct SQL approach for larger numbers of item to import. The only thing I can attribute this to is the caching the data provider does. Especially for the reference items, using the API, even though I’m asking it to find this item for every item, because it’s already cached in the data provider there’s no network trip round to the database server which saves time.

So this begs the question, could I optimise the TSQL I’m generating to perform faster than the API? Perhaps. One area I see the API performing better is to do with the reference items. The data provider cache is used and shared across all the import operations whereas using SQL, any variables I create will only last for that query (single or batch). This means for the entire import operation I’m spending more time looking up reference items than if I use the API.

But I think if it comes down to trying to optimise your SQL to make it faster then you’re not going to get a huge performance increase to justify not using the API. You may spend a considerable amount of time optimising your SQL only to find that you’re at best only marginally better than the API. Don’t waste your time! And how many hundreds of hours do you think Sitecore has sunk into the development and testing of the API and the TSQL it generates? Do you really think you’ll find something they missed in the several hours your project has allocated to this activity?

So which approach should you use? Yes the direct SQL approach runs faster for small numbers of item, but how much additional time are you spending writing that TSQL rather than using the very nice API which is available to you? In my opinion, you should always use the API for your data tasks. But if you do happen to use SQL, don’t forget to clear your caches so you’ll see the data.

Comments

I've often been faced with the same question; "What's the SQL Schema like". My response has always been the same; "Why would you want to know?". Fact is, retrieval of data is what really matters here - and I counter the question (usually) with: "What do you think is faster? getting the data you need by a long series of INNER LEFT JOIN, LIKE '%' and whatnot, or querying against a 36.000 byte Lucene index that is cached in memory?"
Sitecore is not a "live data system" in the traditional sense, and more often than not, you have absolutely no need to worry about database schemas and low level data performance. And good riddons, thanks for that :-)

Maybe it's worth adding; I've just re-implemented an existing T-SQL system on a Sitecore engine. The old engine was performing advanced JOIN and queries against a very complex business system. However, utilizing Sitecore standard technologies such as Lucene Indexing, the LinkDatabase and the Sitecore Data Cache, I was able to outperform the live SQL query model by more than factor 10:1. Sure, if data changes by the minute, Sitecore hits a heapload of problems... but in my experience, most data never does change on that frequency.

Alistair Deneys

Thanks for the comments and feedback Mark. Its good to have real world examples to backup the lab data, and vice versa :)

Great article, I always think it's wise to use the API and the reason for this is simple you would have to create the full/partial implementation of what the API does in T-SQL which would be expensive and costly. I was just thinking about Security, Custom Field Types, Workflow and of course versioning. Even with data migration this would be a costly as well as problematic path to venture down. But as always great to know what the real difference is, keep up the good work :)

Dave

I can think of an application for writing SQL, which is to answer ad-hoc queries where guids referring to items within the Sitecore tree are stored in external tables as foreign keys
This is the issue I have, I want to knock up the occasional SQL one off query to produce a report where most of the data has nothing to do with the content tree. I do not want to do this in code!

Alistair Deneys

Hi Dave, Absolutely, you nailed it. In fact, in my presentation at Sitecore Symposium this year I covered scripting in Sitecore and one of the topics was direct SQL access. The scenario you describe is the same that I recommended people to use direct SQL access. Generally with ad-hoc queries you're not too concerned with performance, so there's no issue here in that regard. Because SQL server stores the field values in tables it's very easy to query all fields regardless of field name, transcending the Sitecore data model. Just don't go making ad-hoc updates through SQL :)

Leave a comment

All fields are required.