Friday, January 19, 2007

Converting an Excel file to CSV - Factor makes XML fun!

Today I was sent a rather important Microsoft Excel document. Unlike older binary formats, this was an XML file, and the version of OpenOffice I had installed on my Debian-stable Linux workstation was too old to cope with it. Rather than upgrading OpenOffice on the Debian box, or installing OpenOffice on my Mac, I decided to write a Factor program to convert the data to CSV, comma separated values. OpenOffice can open that for sure. Just like my previous foray into parsing XML with Factor, writing this code only took a few minutes, including testing, in no small part due to Daniel's excellent documentation.
REQUIRES: libs/xml ;
USING: io sequences xml xml-utils ;
IN: msxml>csv

: print-csv ( table -- ) [ "," join print ] each ;

: (msxml>csv) ( xml -- table )
"Worksheet" find-tag
"Table" find-tag
"Row" find-tags [
"Cell" find-tags [
"Data" find-tag children>string
] map
] map ;

: msxml>csv ( infile outfile -- )
<file-writer> [
<file-reader> read-xml (msxml>csv) print-csv
] with-stream ;

Of course this only handles a very tiny subset of the Excel file format, but it was good enough to convert the file to CSV without losing any information.

The XML file used XML namespaces, but the parser handled that without problems. Daniel was very clever by allowing the programmer to either specify simple strings, or fully-qualified name tuples when searching for tags; allowing quick-and-dirty code to proceed as if the namespaces weren't there, but supporting the full XML specification when one needs it.

The (msxml>csv) word looks pretty in its current state, however for more complex XML processing I can easily imagine searching for tags can get out of hand very quickly. An XPath-like DSL would be a welcome addition to libs/xml, but I need to spend more time working with XML in Factor before I can come up with a Factorish solution. None of the XPath-like solutions I've seen in other languages were really satisfactory.

It is also interesting that the program above contains no stack manipulation at all.


Samuel Tardieu said...

Why did you use "ooxml" instead of "msxml" in the words names?

Slava Pestov said...

My mistake.

Daniel Ehrenberg said...

When you have a problem that can't be satisfactorily solved with find-* and get-* (which will soon be renamed tag-named, tags-named, tag-named* and tags-named*) then I'll find a solution, which may or may not involve creating a DSL. My guess is, no DSL will be needed, but such a problem almost definitely exists.

sohail said...

Wow that is really awesome I must say.

One of the more interesting Python libraries I've used uses reflection to parse XML. So something like this:

(Where XMLObject is the base class)
class MyTag(XMLObject):
subtag = ItemNode()

So this class uses reflection using the names of the members that are ItemNodes to actually parse the XML. How difficult do you think something similar would be in Factor? I haven't seen much of the traditional OO represented in your Factor posts so I apologize if I missed them but it would be interesting to see Factor's dynamic nature in action.