Generating PDFs on the fly

One time or another you come to the point, at least when you work as a web developer, when you need to generate PDFs on the fly. In principle this wouldn’t be too hard: pass the document you want to convert through a PDF printer. Well, it doesn’t sound that hard but talk is one thing and implementation another! I came to the point recently where I, or we, really needed to generate PDFs on the fly. The question of just converting one document to a PDF was not the only problem ahead: We also needed to alter the content, which is some tremendously complex content, of the PDF on the fly.

The workflow is as follows: A person submits a web based application through a web form. The commision is received by a handling officer who continues the application. The commision is sent back to the applier who in turn continues the application. The commision is once again sent to the handling officer who finish the application.

The application form, originally a Word document with loads of checkboxes, input fields and what not it’s the document at the top), at hand here is really, really complex and more or less industry standard. The pure complexity of the form is the reason why we wanted to digitalize it and make it into a more workflow like thing. Since the application form is more or less a standard we wanted the ability to see what the web based application would look like in reality if it was handled the old way through the Word document and of course give us the ability to archive the application in a constant format. And this is the type of document that people like to put in a binder so it needs to be printable and look like it always has looked like.

The first problem

So how do you go about and fill the Word document with values from a database, or how do you dynamically create a PDF that looks exactly like that complex Word document? Creating simple PDFs from a PDF API is not that hard but this really isn’t a simple structured document we’re talkning about so that didn’t sound likely. So I started looking at Adobes LiveCycle products but that would be really expensive (I mean really, really, really expensive) so we had to throw that idea away.

Then it hit me: We’re running Windows servers at work so couldn’t I build a COM based solution and use the COM interface for Word and in that way use the original Word document as a template and dynmialcally fill it with content from the database? I had never touched COM before but I thoght nothing is impossible!

The first solution

So I fired up Visual C# Express and began writing a console application that could be called from the web server. The first thing to solve was how to get all the data from the web server to the console application. Of course the console application could call the database directly but that would make the application hard to maintain, it would make it less dynamic and in general no good (and there are more documents to create and alter than the one I’ve mentioned). XML is no news to me so I though that passing an XML tree as an argument with what fields to fill and with what values would be a good idea. And since during the commision there are several types of documents to generate another argument would be what Word template to use.

So the question of how to get what information to fill the Word document with to the console application was easily solved (I love XPath).

The next question was how to use the Word COM interface, and frankly that was quite easy (thanks to Google). If you use bookmarks in the Word document you can loop over them with ease and just as easily fill them with content. So the principle I used was the following: The XML tree passed to the console application looks like this:

9 lines of XML
  1. <!– The name of the root node doesn’t matter –>
  2. <n>
  3. <field name=“Some_Bookmark1” type=“text”>The value</field>
  4. <field name=“Some_Bookmark2” type=“text”>Another value</field>
  5. <field name=“Some_Bookmark3” type=“checkbox”>true</field>
  6. <!–
  7. … and just as many nodes neccessary …
  8. –>
  9. </n>

so the name attribute is correponding to the name of the bookmark/data field to alter and the node value is quite obvious the value to assign to the bookmark/data field. Text fields and checkboxes is delt with in different ways in the COM API so that’s why the name attribute.

When the bookmarks/data fields has been altered the document is saved in a temporary directory with a uniq name and the path to that document is returned by the console application.

Up to this point…

This is how things work so far:

  1. The console application is called with two aruments:
    1. What template to use
    2. An XML tree with what bookmarks/data fields to fill with what values
  2. The console application opens the template and fill it with data according to the XML tree passed as the second argument.
  3. The console application saves the document with a uniq name in a temporary directory and the path to that document is returned by the console application.

Quite clear I’d say ;)

The second problem

Okey, so now this application needed to be callable from a web page. And frankly that seemed like a minor problem, and so it was. The only minor concern was that the overall web based application workflow was (is) developed by Roxen as a consultant job so the implementaion of this little Word to PDF thing needed to be as simple as possible. So it was time to fire up JEdit and do some Pike hacking.

The second solution

Thanks to the fact that there’s a module in Roxen that can execute external applications and thanks to the fact that the source code of Roxen is available I borrowed some of the code written by Marcus Wellhardt at Roxen ;) . I’m no wicked Pike expert like the dudes at Roxen so I borrowed the code from name that creates a background process and altered it slightly to fit my needs. On top of that I wrote a RXML tag that would be the wrapper to call to create the background process. The tag, a container tag, takes one required argument, the Word template to use, and one optional argument, the path to the directory where the template is stored (this argument is optional because it can be set globally in the module settings), and the XML tree with the bookmarks to fill as content. So the tag would be used like this (still, not all functionallity is there yet):

7 lines of RoXen Macro Language
  1. <word2pdf template=‘the-template.doc’>
  2. <d>
  3. <field name=“Some_Bookmark1” type=“text”>The value</field>
  4. <field name=“Some_Bookmark2” type=“text”>Another value</field>
  5. <field name=“Some_Bookmark3” type=“checkbox”>true</field>
  6. </d>
  7. </word2pdf>

So I tried the tag and indeed a Word document with my given data was created in the temp directory on the server!

The third problem

Now it was time to create a PDF from the temporary Word document. I found an open source PDF printer software called PDFCreator and it looked promising so I went for it. PDFCreator can be installed in “server mode” which was what I wanted and thanks to this guide it was easy to set it up as a service as well.

First I wrote a standalone Pike script for debugging and testing that would kind of emulate a web page request (it’s some what tiresome developing new thing directly through Roxen since the module needs to be reloaded/re-compiled for every change you make to the code). Anyhow, things started to work pretty well: The script created a Word document by calling my console application, the script passed the path to that document to PDFCreator and MS Word was launched and a PDF was printed to the given temp directory. Yes! PDFCreator relies on the software the orginal document is associated with.

Okey, so the code worked fine so I altered my Roxen tag module so that my tag name would behave in the same manner as the standalone Pike script. But when I ran the tag through a web page request PDFCreator hung, or rather the MS Word process created by PDFCreator hung!

The third solution

It really puzzeled me: Everything went ok if I ran the process as my user through my standalone Pike script, but when the process was created by the web server (local system) MS Word got stuck. So I changed the server process to run as an ordinary user account, my user (local administrator) and now it worked perfectly. Then it hit me: maybe the MS Office applications relies on the directory name in name and the user name doesn’t have that (it’s no ordinary user). So it seems that to get MS office apps to run from another process, that process needs to have been created by a normal user account. I dunno, maybe there’s a way around this, like installing MS Office in some kind of server mode or something like that!

Anyhow, my little RXML tag name now did what I wanted and the result of the tag is the raw data of the created PDF, the temporary Word document and the PDF is wiped clean from the server, so we can just store the raw data in a database and deliver it easily to a user when the PDF is requested.

Word2pdf wrapped up

I’m quite satisfied with the solution I came up with: The console application I wrote (I call it WordWriter) seems to work like a charm and the fact that what data to fill the template with is passed as an XML tree makes it really easy to use for more than the specific purpose it was developed for. Plus, if the over all web application Roxen developed (is developing) is changed I don’t need to change the WordWriter application :)

Finally this is how this little name can be used in real life:

25 lines of RoXen Macro Language
  1. <?comment
  2. Here data is pulled from the database according to the
  3. application being handled
  4. ….
  5. ?>
  6. <define variable=“var.pdfdata” trimwhites=“”>
  7. <word2pdf temlplate=“my-template.doc”>
  8. <n>
  9. <field name=“My_Bookmark1” type=“text”>&sql.data1;</field>
  10. <field name=“My_Bookmark2” type=“text”>&sql.data2;</field>
  11. <field name=“My_Bookmark3” type=“text”>&sql.data3;</field>
  12. <!– … –>
  13. <field name=“My_Bookmark10” type=“checkbox”>
  14. <if variable=“&sql.data10; = 1”>true</if>
  15. <else>false</else>
  16. </field>
  17. </n>
  18. </word2pdf>
  19. </define>
  20. <sqlquery host=“dbhost”
  21. query=“UPDATE table SET pdf1 = :pdf WHERE id = :id”
  22. bindings=“pdf=var.pdfdata,”
  23. />

And that should do the trick!

The synergy

At my job we’re like 1200 people and sometimes people want to create PDFs either to put on our web site or e-mail to someone and of course not all computers have Adobe Acrobat installed so wouldn’t it be nice to have a form on the intranet where people could upload a file and get it back as a PDF? Since most of the PDF generating functionality was already at hand I just had to write another RXML tag that just creates a PDF from a document. Said and done so now we have PDF generating form on the intranet.


This has been a really fun thing to do from which I have learned alot of new stuff, and that is probably why I love my work so much: There’s always new things to dive into.