I’m doing an interesting project in work where I am “scraping” data off of word documents, and may do a few blogs on it. Probably not the best way to handle this sort of thing, but it works well without the cost of a third-party tool that transforms data to XML, and the code is fairly straight forward. In my task, the data from a series of .doc and .docx templated files will be moved to a .csv, imported to SQL Server, cleaned, moved to a dimension table, and finally, fed to a cube in Analysis Services. Pretty cool eh? I think so.
One thing about doing this is that you’ll have to have either 1) Microsoft Office installed on the server, or 2) the Primary Interop Assemblies Redistributable, or PIA, installed on the server, both so that you can take advantage of the Office (in my case Word) methods to fetch the data off of the documents if you plan on scheduling and running from the server itself. Most servers normally don’t have Microsoft Office installed – there’s no reason – and to avoid burning a copy of office that costs money, you can add these assemblies…the download is free.
It reminds me back in the day when we used SQL Mail, going back to 1998 and SQL 6.5 and then SQL 7. You had to install the entire Outlook client on the server, and then a smart DBA would go and randomly delete the GUI so nobody could open the client. Just a random thought…didn’t SQL Mail really suck back then, and now? For those of you who started as a SQL practitioner beginning with SQL Server 2005 and only know database mail, you really missed a lot of fun trying to figure out why that abomination never worked correctly. Oh well.
Anyway, get the PIARedist.exe, and start doing some slammin’ Office Automation.
Happy Primary Interopping!
Lee Everest
--------------------------------

http://www.microsoft.com/download/en/details.aspx?id=3508
da7d0a11-f174-4c21-a52e-87eff91dcc40|4|4.0