Extracting text from PDFs

Unwanted line breaks in text copied from PDF
Unwanted line breaks in text copied from PDF

Anybody working with information sooner or later have to copy and paste text from PDF-files. And anybody who has done that knows what a pain in the a** that is! You get actual line breaks from the visual line breaks in the PDF. The other day I began a job where I have to copy and paste text from a whole bunch of PDF files and it didn’t take long before I almost exploded with anger ;)

So I thought: Why not make a simple application that extracts the text from the PDF and – to the most possible degree – normalizes the unwanted line breaks.

And then there was Textifyer

So I fired up Visual C# Express and started hacking. I soon found the PDFbox component – using IKVM.NET – and it didn’t take long before I had some code that actually extracted the text from a PDF file. (a PDF extraction in C# howto)

I figured out how to detect unwanted line breaks: Each line with an unwanted line break ends with a space character. Lines with a wanted line break doesn’t (in 99% of the cases). So it is just a matter of of looping over the lines and if it ends with a space skip adding a line break and just append it to the previous text buffer.

Unwanted line breaks removed
Unwanted line breaks removed

So now I just have to clean up the interface and bug test the program – which will happen automatically since I’m copy and paste from a whole bunch of PDFs at the moment. When I feel it’s working alright I will release the program. It’s really nothing hardcore about it anyway ;)

Textifyer: Drag-n-drop enabled
Of course there’s drag-n-drop support!

New Roxen Application Launcher for Linux written in Vala

This is not the latest version of Roxen Application Launcher. You’ll find the latest version at the download page.

A couple of weeks ago I stumbled upon a fairly new programming language named Vala. I thought it looked promising and since Vala is developed by the GNOME project – with the purpose of making software development for, primarily, GNOME easier – and I’m an avid GNOME user I wanted to look deeper into the world of Vala.

I, and most programmers I believe, work in that way that I need a real and useful project when learning a new programming language. So I thought why not re-writing the Roxen Application Launcher I wrote in C#/Mono a couple of years ago in Vala – which by the way is syntactically very, very similar to C# and Java. I’d gotten tired of always having to fiddle with the C# code with every new version of Mono since something always broke when Mono was updated so a re-write wasn’t going to be totally pointless. The good thing about Vala is that the Vala compiler generates C code and that’s what you compile the program from. Fast code and hopefully more mature and stable libraries that won’t break backwards compatibility with every new release.

What about Vala

So, on I went about it and I think that Vala is a really promising language. It’s still a very young language so some library bindings isn’t behaving exactly as expected and the documentation isn’t directly redundant – although the Vala reference documentation site isn’t half bad. But since Vala pretty much is a wrapper for, or binding to, the underlying C libraries you can find answers to your questions that way. All in all I think Vala has a promising future: Way more simple than C and almost as fast and light on memory (remember the Vala compiler generates C code) and way faster than C#/Mono and free from any Microsoft associations ;) .

What about the Roxen Application Launcher

In this new version I utilize GConf for storing application settings. I also made use of – for the first time – the GNU Build Tools for compilation which also makes it easier to distribute and for others to compile from the sources. This also means that the distributed version compiles from the C sources and not the Vala sources so there’s no need for the Vala compiler to build the program.

Other than that there’s nothing fancy about it. The Vala sources is available at my Github repository.

Roxen Appliction Launcher 0.4.2 19:38, Sun 20 December 2009 :: 374 kB


The screenshots is showing the Swedish translation.

List of downloaded files
Screenshot 1 of the Roxen Application Launcher

Adding support for new file type
Screenshot 2 of the Roxen Application Launcher

The GNOME status icon
Screenshot 3 of the Roxen Application Launcher

Vala – the new programming language for Gnome

This Friday when I read the last issue (in Sweden) of Linux Format there was an article on a new programming language named Vala. The goal for Vala is to provide a modern programming language for, primarily, developing Gnome applications. There is of course Mono, but Vala doesn’t run in a virtual machine but is complied to machine code. But Vala resembles C# syntactically and has borrowed a lot of concepts from C#.

From what I understand Vala code is first translated into plain old C code, and then compiled with the ordinary GCC compiler. The benefit is that you don’t have to get head aches about memory management and so forth.

Vala seems pretty interesting and I downloaded it and compiled it without any difficulties. There are precompiled packages for most Linux distros – it’s available in the Ubuntu repository – but since Vala is new and still under development the distro packages is far behind in version.

There’s also a Vala IDE and plugins for GEdit, Anjuta and Eclipse.

Anyway, just to try Vala out I made a little program – using Val(a)ide – that changes the desktop wallpaper on a per interval basis. Send the program a path to a directory with images and the background will change among those images every *nth minutes.

This program needs libgee which at the moment needs to be added manually.

Compile: valac --pkg gconf-2.0 --pkg gee-1.0 --pkg gio-2.0 main.vala -o wallpaper-iterator

239 lines of Vala
  1. /* main.vala
  2. *
  3. * Copyright (C) 2009 Pontus Östlund <spam@poppa.se>
  4. *
  5. * No license what so ever. Do what ever you like…
  6. *
  7. * Author:
  8. * Pontus Östlund <spam@poppa.se>
  9. */
  10. // GLib isn’t really neccessary since it’s imported automatically
  11. using GLib, Gee, GConf;
  12. /**
  13. * User defined exception types
  14. */
  15. errordomain IOError {
  18. }
  19. /**
  20. * Main class
  21. */
  22. public class Main
  23. {
  24. /**
  25. * Used as reference in option parser
  26. */
  27. static int arg_delay;
  28. /**
  29. * Application command line options
  30. */
  31. const OptionEntry[] options = {
  32. { “delay”, ‘d’, 0, OptionArg.INT, ref arg_delay,
  33. “Number of minutes to wait before swapping background”, null },
  34. { null }
  35. };
  36. /**
  37. * Main method
  38. *
  39. * @param args
  40. * Command line arguments
  41. */
  42. public static int main (string[] args)
  43. {
  44. try {
  45. var opt = new OptionContext(“\”/path/to/wallpapers\””);
  46. opt.set_help_enabled(true);
  47. opt.add_main_entries(options, null);
  48. opt.parse(ref args);
  49. }
  50. catch (GLib.Error e) {
  51. stderr.printf(“Error: %s\n”, e.message);
  52. stderr.printf(“Run ‘%s –help’ to see a full list of available “+
  53. “options\n”, args[0]);
  54. return 1;
  55. }
  56. if (args.length < 2) {
  57. stderr.printf(“Missing argument!\n”);
  58. stderr.printf(“Run ‘%s –help’ for usage\n”, args[0]);
  59. return 1;
  60. }
  61. // Default time before changing background is 30 minutes
  62. int delay = arg_delay > 0 ? arg_delay*60 : 60*30;
  63. try {
  64. WallpaperIterator bg = new WallpaperIterator(args[1], delay);
  65. bg.run();
  66. bg.stop();
  67. }
  68. catch (IOError e) {
  69. stderr.printf(“Error: %s\n”, e.message);
  70. return 1;
  71. }
  72. finally {
  73. stderr.printf(“Finally block reached!\n”);
  74. }
  75. return 0;
  76. }
  77. }
  78. /**
  79. * Class that handles changeing of desktop wallpapers on a per interval way
  80. */
  81. public class WallpaperIterator : GLib.Object
  82. {
  83. /**
  84. * Number of seconds to wait before changeing wallpaper
  85. */
  86. private int delay = 0;
  87. /**
  88. * Current index in the images list
  89. */
  90. private int index = 0;
  91. /**
  92. * List of images to change between
  93. */
  94. private ArrayList<string> files = new ArrayList<string>();
  95. /**
  96. * List of allowed content types
  97. */
  98. private ArrayList<string> allow = new ArrayList<string>();
  99. /**
  100. * GConf client object
  101. */
  102. private GConf.Client gclient = null;
  103. /**
  104. * GConf registry where the background image is set
  105. */
  106. private string gconf_key = “/desktop/gnome/background/picture_filename”;
  107. /**
  108. * Background being used when the application is started
  109. */
  110. private string default_bg = null;
  111. /**
  112. * Constructor
  113. *
  114. * @param dir
  115. * The directory to collect wallpapers in
  116. * @param delay
  117. * Time to wait – in seconds – before swapping wallpaper
  118. */
  119. public WallpaperIterator(string dir, int delay)
  120. throws IOError
  121. {
  122. assert(delay > 0);
  123. allow.add(“image/jpeg”);
  124. allow.add(“image/png”);
  125. var f = File.new_for_path(dir);
  126. if (!f.query_exists(null))
  127. throw new IOError.FILE_NOT_FOUND(” \”%s\” doesn’t exits!”.printf(dir));
  128. if (f.query_file_type(0, null) != FileType.DIRECTORY) {
  129. throw new IOError.NOT_A_DIRECTORY(” \”%s\” is not a directory!”
  130. .printf(dir));
  131. }
  132. collect(f.get_path());
  133. if (files.size < 2) {
  134. warning(“Not enough images found for this application to be useful!\n”);
  135. return;
  136. }
  137. this.delay = delay;
  138. }
  139. /**
  140. * Run the application. Starts a MainLoop
  141. */
  142. public void run()
  143. {
  144. gclient = GConf.Client.get_default();
  145. default_bg = gclient.get_string(gconf_key);
  146. MainLoop loop = new MainLoop(null, false);
  147. var time = new TimeoutSource(delay*1000);
  148. time.set_callback(() => { swap(); });
  149. time.attach(loop.get_context());
  150. loop.run();
  151. }
  152. /**
  153. * Stop the application. Tries to reset the background
  154. */
  155. public void stop()
  156. {
  157. try {
  158. if (default_bg != null)
  159. gclient.set_string(gconf_key, default_bg);
  160. }
  161. catch (GLib.Error e) {
  162. warning(“Failed to restore background!\n”);
  163. }
  164. }
  165. /**
  166. * Does the actual background swapping
  167. */
  168. private void swap()
  169. {
  170. if (index >= files.size) {
  171. debug(“Restart…\n”);
  172. index = 0;
  173. }
  174. debug(“Callback…%-2d (%s)\n”, index, files[index]);
  175. try {
  176. gclient.set_string(gconf_key, files[index]);
  177. }
  178. catch (GLib.Error e) {
  179. warning(“Failed to set background: %s\n”, e.message);
  180. }
  181. index++;
  182. }
  183. /**
  184. * Collect images
  185. *
  186. * @param path
  187. */
  188. private void collect(string path)
  189. {
  190. try {
  191. var dir = File.new_for_path(path).enumerate_children(
  192. “standard::name,standard::type,standard::content-type”,
  193. 0, null
  194. );
  195. FileInfo fi;
  196. string fp;
  197. while ((fi = dir.next_file(null)) != null) {
  198. fp = path + “/” + fi.get_name();
  199. if (fi.get_file_type() == FileType.DIRECTORY)
  200. collect(fp);
  201. else {
  202. if (fi.get_content_type() in allow)
  203. files.add(fp);
  204. else
  205. warning(“Skipping \”%s\” due to unallowed content type\n”, fp);
  206. }
  207. }
  208. }
  209. catch (GLib.Error e) {
  210. warning(“Error: %s\n”, e.message);
  211. }
  212. }
  213. }

I think Vala looks promising and I will try it out trying to write a more complex application when I find the time.

Visual C# 2008 Express bug

The other day at work I was hacking up a console application that converts an Excel workbook to chosen format. This is to be used in a web application where some people can upload an Excel file that will populate a MySQL database. To make it easier to parse the Excel file the console application will save a copy of the file as a tab separated file that is easier to parse (I will share the code when done).

Visual C# 2008 default layout

Anyhow! At work I use Visual C# 2008 Express Editon when – very rarely – hacking C# and it annoyed me that the code part was too far to the left for me to feel comfortable. So I though I could displace the right column with class viewer and solution explorer and put that on my right screen, narrowing the main window and put farther to the right on my left screen.

Error message when Visual C# crashes

So I did and it felt alright – until I tried to compile the application. Visual C# just crashed big time! I didn’t have any clue why at that moment so I fired it up again and had to displace the right column again and put everything where I wanted it. I tried to compile again and boom, crash again!

So I thought: What the heck! Fired the thing up again, displaced the right column but this time I thought I’d just close the application so it would remember my settings the next time it was started. So I just hit ALT+F4 and boom, crash again. After repeating this again I came to the conclusion that Visual C# 2008 crashes if you displace the right column and then tries to either compile the current project or close the application it self. There’s probably some other actions that will cause it to crash as well.

vs-expressMy desired layout

The final conclusion is that I will have to use the application layout dear Microsoft has decided I should use. Thank you very much! Okey, I can live with it but it’s a bit annoying!

Roxen Application Launcher 0.3

I just created an updated version of Roxen Launcher. I got an error report from one who tried to build the application:

./ApplCon.cs(232,24): error CS0122:
‘System.Net.Sockets.TcpListener.Server’ is inaccessible due
to its protection level

An explanation of the error can be found at MSDN. The reason seems to be due to TcpListener.Server being a protected property but I tried to access it as public. The strange thing is that I didn’t get an error about it nor on my machine at home or at work. Now I created a derived class of TcpListener.Server so that I can access the protected property.

We’ll see if it works!

Roxen Application Launcher 17:31, Sat 17 October 2009 :: 97.3 kB

Gnome Font Manager

One thing I miss on Linux Gnome is font manager. Not just a font viewer but a proper manager like the old Adobe Type Manager. So I thought: Well, lets create one then! It might be that it already exist some font managers for Linux/Gnome but as always; this will be a good project for learning new stuff so I really don’t care if there are 1000 font managers out there ;)

Font parsing

The first thing to do, and that I have done, is porting my font parser from PLib to C#. That was no major head ache. There are at least to different font classes available in C# Mono (`System.Drawing.Font` and `Pango.Font`) but they don’t give all information about the font that I want.

Worth mentioning is that I heavily used the Mono DataConverter class to unpack the binary strings in the fonts. The unpack() function in PHP is just tremendous and there doesn’t seem to be a native alike in C#. But thanks to Miguel de Icasa’s DataConverter it went quite alright.

Font preview

The next thing to do was figuring out how to create the font previews. And I figured it out ;) First I though of using the console program gnome-thumbnail-font to create the previews but I had to throw that one into the bin since it doesn’t seem to handle multi line text. Since I’ve never used the graphics functions in C# before I came to the conclusion that I had to create the previews all by my self. It was quite easy finding good examples on the net of how to create text images with C#. A couple of hours later that problem was also solved (as you can see in the screen shot below). And man the graphics stuff in C# is fast. The preview images are generated instantly!

Next step

I have a lot left to do before this is a useful program but we’re heading there. One feature I’m planning on implementing is the ability to create your own font sets that you can activate/deactivate.

And I will probably come up with some more stuff to add, but that will be a later head ache!

Screen shot: Gnome Font Manager
Gnome Font Manager

Roxen Application Launcher for Linux

I thought I should broaden my C# knowledge a bit and you know how it is: To learn new stuff you need a real project to work on or else you will lose the fire sooner than later. So I came up with a good project that is actually useful to me: Porting Roxen’s “Application Launcher” to C#. There’s nothing wrong with the original one, written in Pike, except that it uses GTK 1 which is quite hideous (in an aesthetic meaning) compared to the newer GTK 2. And I also though it would be cool to create a panel applet (in the notification area of Gnome so you could put the Application Launcher in the background).

BTW: For those of you not knowing what the heck Roxen’s Application Launcher (AL here after) is here’s a brief explanation: Files in Roxen CMS is stored in a CVS file system which means that you don’t deal with files the way you normally do. To manage files you use a web browser interface (which is a darn good one I might add) but sometimes you actually want to edit files in your standard desktop application. And it is here the AL comes to play. You can download a file through the browser interface so that the file is opened in the AL. AL will then open the file in the desktop application you have associated with the file’s content-type. When you make your changes and saves them the AL will directly upload the changes to the server. So in short I could have said: The Application Launcher is a means to edit files on a remote Roxen server with a preferred desktop application.

The obstacles

I must say I’ve learned a lot from this project!

First off: If you download a file for editing and the AL is already started you don’t want to start a new instance of AL (this is something I have never ever thought about before – in general terms, not just concerning AL) but when you do think about it you find that it’s not a piece of cake to solve. I solved it the way it is solved in the original AL. The first instance of AL that is started also starts a “socket server” that listens for incoming traffic on a given port on the local IP. When a new instance is started it first checks if it can connect to said port and if it can it sends the arguments through the socket to the first instance which then handles the request. The second instance is simply terminated when it has send the data though the socket.

So there I had to do some socket programming. Great fun :)

Secondly: Stuff happens in the background of AL – data send through the socket remember – which means that nothing happens when you try to update the Graphical User Interface. (NOTE! This is the first more advanced desktop application I’ve done.) After “Google-ing” around a bit I came to know that this was a real newbie problem ;) The thing is that the GUI can only be updated through the same thread that started it so when using background threads – implicitly that’s what I’m doing although handled by the asynchronous callback infrastructure of C# – you need to make sure the GUI is updated through the main thread. This is the most simple way so solve it:

3 lines of C#
  1. Gtk.Application.Invoke(delegate{
  2. CallFunctionToUpdateGUI();
  3. });

That’s not too difficult when you know it ;)

Thirdly: The AL is sending data back and forth through the HTTP protocol which means we have to use some sort of HTTP client. C# has a couple of ways to do this but unfortunately they came up short, or I couldn’t use them anyway. I didn’t manage to figure out exactly why I always caught an exception saying something like: A protocol violation occurred!. I’m far from the only one who have fought with this and it has something to do with the headers sent from the remote server. You can invoke “unsafe header parsing” but that was to much of a hassle so I created my own little HTTP client.

One big annoying thing with C# is that is seems almost impossible to turn data from streams into strings without having to use any one of the System.Text.Encoding.* classes/objects which in my case meant that images and files in binary form got seriously fucked up. I manged to solve this my never turning the data into a string but keeping it as a System.Text.Encoding.* all the way from request to response to saving to disk. It was rather irritating but at the same time nice when solved (and I learned a whole bunch about System.Text.Encoding.*, System.Text.Encoding.*, System.Text.Encoding.* and System.Text.Encoding.*.)

Finally: Of course I learned a great deal more about C# but this blog post is starting to get pretty excessive so I will round it off by saying that MonoDevelop is starting to become pretty darn good! I just upgraded to the latest version of Ubuntu and that also meant that I got the latest MonoDevelop and I must say it’s more stable than ever (although it occasionally crashes) and a whole bunch of new features are in place. One I havn’t used before – although it might have existed before – is the “Deployment” stuff. It creates a package with configure and make files for optimal compilation. Really smooth!

Source and screens

I will finish off by adding the source files and a few screen shots:

Roxen Application Launcher 17:31, Sat 17 October 2009 :: 96.7 kB

Screen shot 1: Just a standard view
Roxen Application Launcher 1

Screen shot 2: The panel applet in action
Roxen Application Launcher 2

Screen shot 3: The Application Launcher in Swedish
Roxen Application Launcher 3

Screen shot 4: Adding support for a new content type
Roxen Application Launcher 4

PHP documentation browser

Even though I have my vacation right now I need to hack some ;)
What I have done is a PHP documentation browser, I simply call it PHPDoc Browser, for GNU/Linux to read the PHP documentation locally on the computer (it’s a little bit like CHM for Windows). This little app isn’t that necessary but I thought it was a great thing for learning new stuff: The application is written in Mono/C# and uses SQLite for storing the search index.

Difference from CHM

Although there’s a CHM equivalent, XCHM, for Unix like systems I decided that my application serves a purpose: What I like about the PHP manual is that you can hit www.php.net/the_function in you browser and you will get the documentation for “the_function”. Of course I implemented the same functionality in PHPDoc Browser. You can also do a wild card search like “array_*” and you will get a list of all “array_*” functions.

I also implemented a free text search but that’s not “a real” FTS at the moment. FTS in SQLite in a little bit harder than in MySQL so while awaiting the next SQLite version the FTS is a bit of a hack. But you can quote the search to narrow it down ;)

What’s learned?

Most of the unnecessary stuff I do I do to learn and the PHPDoc Browser is no exception. I learned how to use SQLite in Mono/C# and I also wrote a self contained installer in Bash, and Bash I havn’t really touch although I’ve been using GNU/Linux for 6-7 years! I found it quite fun writing the installer :)


If you would like to try it out:

  1. Download the install script
  2. Un-tar it (`tar zxvf phpdocbrowser-installer.tgz`)
  3. Make sure the script is executable (`chmod +x phpdocbrowser-installer`)
  4. Open a console and run the installer: ./phpdocbrowser-installer
  5. Answer the questions and you’r ready to go.

The script will create a directory, ./phpdocbrowser-installer, in your home directory in wich you will find ./phpdocbrowser-installer, ./phpdocbrowser-installer and a directory, ./phpdocbrowser-installer, with the PHP documentation. A “run script” will also be installed either in ./phpdocbrowser-installer or in your home directory depending on how you answered the questions and if you agreed to create a desktop shortcut one will hopefully appear on your desktop (havn’t tested this on KDE but it should work fine in Gnome).

To run the application hit the desktop shortcut, if one was created, or invoke ./phpdocbrowser-installer from a console.

If you wish to remove the PHPDoc Browser run the following from a console:

And that’s that!


PHPDoc Browser 17:31, Sat 17 October 2009 :: 8.9 MB

Screen shots




Generating PDFs on the fly

One time or another you come to the point, at least when you work as a web developer, when you need to generate PDFs on the fly. In principle this wouldn’t be too hard: pass the document you want to convert through a PDF printer. Well, it doesn’t sound that hard but talk is one thing and implementation another! I came to the point recently where I, or we, really needed to generate PDFs on the fly. The question of just converting one document to a PDF was not the only problem ahead: We also needed to alter the content, which is some tremendously complex content, of the PDF on the fly.

The workflow is as follows: A person submits a web based application through a web form. The commision is received by a handling officer who continues the application. The commision is sent back to the applier who in turn continues the application. The commision is once again sent to the handling officer who finish the application.

The application form, originally a Word document with loads of checkboxes, input fields and what not it’s the document at the top), at hand here is really, really complex and more or less industry standard. The pure complexity of the form is the reason why we wanted to digitalize it and make it into a more workflow like thing. Since the application form is more or less a standard we wanted the ability to see what the web based application would look like in reality if it was handled the old way through the Word document and of course give us the ability to archive the application in a constant format. And this is the type of document that people like to put in a binder so it needs to be printable and look like it always has looked like.

The first problem

So how do you go about and fill the Word document with values from a database, or how do you dynamically create a PDF that looks exactly like that complex Word document? Creating simple PDFs from a PDF API is not that hard but this really isn’t a simple structured document we’re talkning about so that didn’t sound likely. So I started looking at Adobes LiveCycle products but that would be really expensive (I mean really, really, really expensive) so we had to throw that idea away.

Then it hit me: We’re running Windows servers at work so couldn’t I build a COM based solution and use the COM interface for Word and in that way use the original Word document as a template and dynmialcally fill it with content from the database? I had never touched COM before but I thoght nothing is impossible!

The first solution

So I fired up Visual C# Express and began writing a console application that could be called from the web server. The first thing to solve was how to get all the data from the web server to the console application. Of course the console application could call the database directly but that would make the application hard to maintain, it would make it less dynamic and in general no good (and there are more documents to create and alter than the one I’ve mentioned). XML is no news to me so I though that passing an XML tree as an argument with what fields to fill and with what values would be a good idea. And since during the commision there are several types of documents to generate another argument would be what Word template to use.

So the question of how to get what information to fill the Word document with to the console application was easily solved (I love XPath).

The next question was how to use the Word COM interface, and frankly that was quite easy (thanks to Google). If you use bookmarks in the Word document you can loop over them with ease and just as easily fill them with content. So the principle I used was the following: The XML tree passed to the console application looks like this:

9 lines of XML
  1. <!– The name of the root node doesn’t matter –>
  2. <n>
  3. <field name=“Some_Bookmark1” type=“text”>The value</field>
  4. <field name=“Some_Bookmark2” type=“text”>Another value</field>
  5. <field name=“Some_Bookmark3” type=“checkbox”>true</field>
  6. <!–
  7. … and just as many nodes neccessary …
  8. –>
  9. </n>

so the name attribute is correponding to the name of the bookmark/data field to alter and the node value is quite obvious the value to assign to the bookmark/data field. Text fields and checkboxes is delt with in different ways in the COM API so that’s why the name attribute.

When the bookmarks/data fields has been altered the document is saved in a temporary directory with a uniq name and the path to that document is returned by the console application.

Up to this point…

This is how things work so far:

  1. The console application is called with two aruments:
    1. What template to use
    2. An XML tree with what bookmarks/data fields to fill with what values
  2. The console application opens the template and fill it with data according to the XML tree passed as the second argument.
  3. The console application saves the document with a uniq name in a temporary directory and the path to that document is returned by the console application.

Quite clear I’d say ;)

The second problem

Okey, so now this application needed to be callable from a web page. And frankly that seemed like a minor problem, and so it was. The only minor concern was that the overall web based application workflow was (is) developed by Roxen as a consultant job so the implementaion of this little Word to PDF thing needed to be as simple as possible. So it was time to fire up JEdit and do some Pike hacking.

The second solution

Thanks to the fact that there’s a module in Roxen that can execute external applications and thanks to the fact that the source code of Roxen is available I borrowed some of the code written by Marcus Wellhardt at Roxen ;) . I’m no wicked Pike expert like the dudes at Roxen so I borrowed the code from name that creates a background process and altered it slightly to fit my needs. On top of that I wrote a RXML tag that would be the wrapper to call to create the background process. The tag, a container tag, takes one required argument, the Word template to use, and one optional argument, the path to the directory where the template is stored (this argument is optional because it can be set globally in the module settings), and the XML tree with the bookmarks to fill as content. So the tag would be used like this (still, not all functionallity is there yet):

7 lines of RoXen Macro Language
  1. <word2pdf template=‘the-template.doc’>
  2. <d>
  3. <field name=“Some_Bookmark1” type=“text”>The value</field>
  4. <field name=“Some_Bookmark2” type=“text”>Another value</field>
  5. <field name=“Some_Bookmark3” type=“checkbox”>true</field>
  6. </d>
  7. </word2pdf>

So I tried the tag and indeed a Word document with my given data was created in the temp directory on the server!

The third problem

Now it was time to create a PDF from the temporary Word document. I found an open source PDF printer software called PDFCreator and it looked promising so I went for it. PDFCreator can be installed in “server mode” which was what I wanted and thanks to this guide it was easy to set it up as a service as well.

First I wrote a standalone Pike script for debugging and testing that would kind of emulate a web page request (it’s some what tiresome developing new thing directly through Roxen since the module needs to be reloaded/re-compiled for every change you make to the code). Anyhow, things started to work pretty well: The script created a Word document by calling my console application, the script passed the path to that document to PDFCreator and MS Word was launched and a PDF was printed to the given temp directory. Yes! PDFCreator relies on the software the orginal document is associated with.

Okey, so the code worked fine so I altered my Roxen tag module so that my tag name would behave in the same manner as the standalone Pike script. But when I ran the tag through a web page request PDFCreator hung, or rather the MS Word process created by PDFCreator hung!

The third solution

It really puzzeled me: Everything went ok if I ran the process as my user through my standalone Pike script, but when the process was created by the web server (local system) MS Word got stuck. So I changed the server process to run as an ordinary user account, my user (local administrator) and now it worked perfectly. Then it hit me: maybe the MS Office applications relies on the directory name in name and the user name doesn’t have that (it’s no ordinary user). So it seems that to get MS office apps to run from another process, that process needs to have been created by a normal user account. I dunno, maybe there’s a way around this, like installing MS Office in some kind of server mode or something like that!

Anyhow, my little RXML tag name now did what I wanted and the result of the tag is the raw data of the created PDF, the temporary Word document and the PDF is wiped clean from the server, so we can just store the raw data in a database and deliver it easily to a user when the PDF is requested.

Word2pdf wrapped up

I’m quite satisfied with the solution I came up with: The console application I wrote (I call it WordWriter) seems to work like a charm and the fact that what data to fill the template with is passed as an XML tree makes it really easy to use for more than the specific purpose it was developed for. Plus, if the over all web application Roxen developed (is developing) is changed I don’t need to change the WordWriter application :)

Finally this is how this little name can be used in real life:

25 lines of RoXen Macro Language
  1. <?comment
  2. Here data is pulled from the database according to the
  3. application being handled
  4. ….
  5. ?>
  6. <define variable=“var.pdfdata” trimwhites=“”>
  7. <word2pdf temlplate=“my-template.doc”>
  8. <n>
  9. <field name=“My_Bookmark1” type=“text”>&sql.data1;</field>
  10. <field name=“My_Bookmark2” type=“text”>&sql.data2;</field>
  11. <field name=“My_Bookmark3” type=“text”>&sql.data3;</field>
  12. <!– … –>
  13. <field name=“My_Bookmark10” type=“checkbox”>
  14. <if variable=“&sql.data10; = 1”>true</if>
  15. <else>false</else>
  16. </field>
  17. </n>
  18. </word2pdf>
  19. </define>
  20. <sqlquery host=“dbhost”
  21. query=“UPDATE table SET pdf1 = :pdf WHERE id = :id”
  22. bindings=“pdf=var.pdfdata, id=sql.id”
  23. />

And that should do the trick!

The synergy

At my job we’re like 1200 people and sometimes people want to create PDFs either to put on our web site or e-mail to someone and of course not all computers have Adobe Acrobat installed so wouldn’t it be nice to have a form on the intranet where people could upload a file and get it back as a PDF? Since most of the PDF generating functionality was already at hand I just had to write another RXML tag that just creates a PDF from a document. Said and done so now we have PDF generating form on the intranet.


This has been a really fun thing to do from which I have learned alot of new stuff, and that is probably why I love my work so much: There’s always new things to dive into.