Realisation About File Formats

Tue Jul 20 2010 Posted in:

I’m currently working on a new shared source module that has required me to import content from published files. Now, I could try and talk around and conceal what it is I’m doing but it only really makes sense if you know exactly what it is I’m doing. And what the hell, let’s commit myself and the release of this new module to the greater Sitecore community. The new module I’m working on is an integrated help module which will present an MSDN style application inside the Sitecore desktop to provide help to users. The application allows browsing the content by category in a tree, or searching by keywords. For launch I’m trying to get all the official Sitecore documentation found on SDN into the module to provide help content on how to use Sitecore.

Integrated Help Module

The module stores it’s content in the content tree of the core database. This means that you could also use this module to provide in client help to your authors as you can update the existing content and add your own. So importing content from a document should be pretty easy right? All you gotta do it read it in, grab the content item and push the content in, yeah? The integrated help module application itself was quite easy to build and has taken far less time than that spent on trying to populate the content into it. The distinct difference with the integrated help module content is that I didn’t want a single giant item that contains the content of a single document. Instead I wanted to be able to break a document up into many smaller items to make information easier to find. Initially I started simply copying and pasting from the PDF documents I could find on the SDN. But for some reason I couldn’t do a straight copy paste from the PDF document as it would crash IE for some reason. I had to copy from the PDF, then paste to notepad, then copy the plain text, then paste to the content editor, then style the content in there. Measuring how long this was taking me and the amount of spare time I had to dedicate to this module I worked out I would need 1.5 to 2 months just to manually import the content. Not to mention that the content would need to be updated on each Sitecore release when new documents are made available. Time to look for another solution. So I started looking into various PDF libraries which I could use from C# and Sitecore to import the content of the files. My idea was to split the document on the headings. That would be logical. So I could create a new item when I hit a top level heading, then import content to that item until I hit the next subheading, then create a new child item and import that section’s content into it and so on. That should be pretty straight forward, I just need to work out how to identify the headings in the PDF document. This was my first mistake; the assumption that I could distinguish a heading from normal content in the PDF file. The reason this is so hard is because PDF is a distilled, post script format. It’s the same as printing a document to a printer. You end up with a piece of paper with various patterns of ink on it that humans can interpret. We interpret the larger text on the page as a heading. But did the printer know it was a heading? No. The computer just told it to now print in 20 point and bold, not print a heading. And this is what you’re dealing with in the PDF document. Now I wasn’t about to try and decipher the text style changes in the document to determine which was a heading and which was content so I gave up trying to use PDF files and instead put in a request to the Sitecore documentation team for the docx files the PDFs were generated from. Graciously they were very helpful and sent them to me. Onto the next approach. I thought that if PDF documents were distilled then the source documents would be easier to work with and should contain the semantic information I was after. So I started trying to understand the docx format. For anyone would doesn’t know, docx is actually a zip archive which contains the content and assets of your document. You can unzip a docx file with any zip utility to see how the file is made up. The core piece of the document is the content.xml file which holds the content of your document in an XML format. We all know XML, so this should be easy to manipulate. I could just look through the content for heading tags or something. Now, maybe the reason I keep thinking all documents contain semantic information and structure is because I’m a web developer and I’m used to HTML where a heading is a heading and a bullet list is a bullet list, not just text with styling. And this is where I finally had an Epiphany: PDF and docx don’t contain semantic information about the content they contain and structure like HTML does. Even the docx file simply stored the content and put style tags on each section of content to distinguish it as a heading. Here’s an excerpt from a docx file showing a heading.

<w:p w:rsidR="001B120F" w:rsidRDefault="001B120F" w:rsidP="00C50147">
  <w:pPr>
    <w:pStyle w:val="Heading3"/>
    <w:rPr>
      <w:lang w:eastAsia="da-DK"/>
    </w:rPr>
  </w:pPr>
  <w:bookmarkStart w:id="48" w:name="_Ref232497934"/>
  <w:bookmarkStart w:id="49" w:name="_Toc248797676"/>
  <w:r>
    <w:rPr>
      <w:lang w:eastAsia="da-DK"/>
    </w:rPr>
    <w:lastRenderedPageBreak/>
    <w:t>Analytics Email Distribution</w:t>
  </w:r>
  <w:bookmarkEnd w:id="48"/>
  <w:bookmarkEnd w:id="49"/>
</w:p>
<w:p w:rsidR="001B120F" w:rsidRDefault="001B120F" w:rsidP="00C50147">
  <w:pPr>
    <w:rPr>
      <w:lang w:eastAsia="da-DK"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:lang w:eastAsia="da-DK"/>
    </w:rPr>
    <w:t xml:space="preserve">To configure analytics report distribution by email, in the Content Editor, 
    edit the Schedule field in the Data section of the </w:t>
  </w:r>
  <w:r w:rsidRPr="00EC597F">
    <w:rPr>
      <w:rStyle w:val="SitecoreCodeChar"/>
      <w:lang w:eastAsia="da-DK"/>
    </w:rPr>
    …

You can see the “Heading3” style above, but that’s just a name. The element that contains the heading content is not a heading element as you’d expect in HTML. As you can see the heading style (the “Heading3” above) is defined in a sibling element to the element that contains the content. So even trying to navigate to content and find what style to apply proved difficult, but not impossible. The trick was to process each w:p element in turn, and reading the style and content from within the element. Luckily the docx format uses friendly style names which can be used to impose the semantics I’m after. However, this only works when the document author does the right thing and uses the styles in the document. Of course the Sitecore documents follow this requirement. And this is where I’m currently at, writing a docx import utility for the integrated help module where I can take different actions based on the style of a section of content. I hope my experiences above help guide you if you have to endeavour down the path of doing more than just reading the text content out of these format of document.

Alistair Deneys

Sat Jul 24 2010

Thanks Alexander, No, no plans for that at this stage. But seeing as though docx is an open standard this would now be easier than ever to do. If you wanted to look into this I would recommend using the Microsoft docx libraries (as I'm using for the integrated help module content importer) but keep in mind these libraries are more for using with the packaging of a docx, not the content.

Realisation About File Formats

Comments

Leave a comment