‘SQLizing’ Excel Spreadsheets with ThinkAutomation.June 26, 2020
I Have the Power! (To Query)
In a business-administrative context, we often refer to spreadsheets as databases. It could be a database of customers or weekly sales targets, but regardless of the content, it’s a place where we frequently add and refer to data. But what happens when the Excel spreadsheet we started to work on becomes a multi-gigabyte, mammoth database full of incredibly verbose data, which is hard to filter or search through? Well, it’s usually a signal that a dedicated, back-end database is needed.
So what’s a good example of this scenario? Well, suppose you’ve had a department within your company that has to run lots of different reports at any given moment. It’s very common to have executives request data at a moment’s notice, and suddenly Bob in customer services is combing a spreadsheet for very specific, cross-referenced data, filtered a million times over, between multiple sheets.
Nightmare. We don’t have our own customer tracking software, we just run everything from this one giant spreadsheet. I want to be able to search and get datasets quickly, but I don’t want to go through the pain-points of moving away from a spreadsheet.
There is a technical solution, and while it does require some basic understanding of SQL queries, depending on how you want to run those queries, it can make automating the process of grabbing data from spreadsheets a bit easier.
Why Do This?
You might already be exploring a more technically-focused solution to fetching data but don’t want to implement a new (or even bespoke) software system.
The main reason to do this is that it is an easy entry point for an IT team to start helping you with the way you automate reports, without them having to understand your spreadsheets. SQL as a query language is pretty universal, so somebody in an IT team can write a query for most types of data meeting many different conditions, target your spreadsheet and then fetch the data without combing through it. The most they would need to know is what columns are in the spreadsheet.
Secondly, the benefit of keeping your spreadsheet, besides not having to move the data, is that other colleagues outside of the IT team can continue to add and update data, which is then pulled into the reports automatically by the queries that have been written.
It might even mean that Bob in customer services doesn’t even need to do anything to produce a report anymore. Requests for reports can become an IT request, with this comes an SLA, a standardised way of producing and outputting the report etc.
One other benefit centres around the authenticity of data. Now, nobody is suggesting that Bob in customers services might ‘cook the books,’ but having the production of reports handled by a team that does not have awareness of the particular domain the data resides in creates a separation of concerns, making it less likely that somebody could tamper with data now that we are using an automated process.
Creating a DSN for the Spreadsheet
The rest of this guide will be more technical, but it can be implemented by pretty much anybody who can use basic office software and may understand basic SQL query syntax. Get ready for more acronyms!
Databases interpret queries and commands sent to them from other programs in the form of drivers, typically, ADO or ODBC, and spreadsheets are no different.
A DSN (Data Source Name) is an object we can create in Windows which is aware of which driver to use for a specific database, or in this case, a spreadsheet we intend to use as if it were a relational database.
Note: For this example we’re using a driver that is working on .xls file formats. Drivers for xlsx and other spreadsheet types are available but not neccessary for my example.
Creating the DSN is simple.
- In the Windows search bar, search for ‘ODBC Data Sources.’ (You will see either 64-bit or 32-bit, or both depending on your machine.) Select the 32-Bit version. In windows, this should contain a driver for Excel by default.
2. On opening, you will see the below window. Select the ‘System DSN’ tab and then click ‘Add.’
3. Select the ‘Microsoft Excel Driver’ from the list. (This driver should already be installed along with Microsoft Office)
3. You should then see a prompt asking you to give your DSN a name and to select the spreadsheet you wish to target. In the example below, I have named my DSN ‘ExcelDSN’ and selected my test worksheet – ‘helloworld.xls’
‘HelloWorld.xls’ is a very small spreadsheet! We want to automatically get all the rows from it and parse out the values in column1 and column2 to use somewhere else (ie – send an email or compile a report etc.)
So in this very basic example, I need to be able to do a query which results in two variables, the data in column1 – ‘Hello’ and the data in column2 ‘World.’
Now that a DSN is in place, we’re in a position to run SQL queries on a spreadsheet from code, but what if querying the spreadsheet is only one link in the chain? Here is where I would use the RPA tool, ThinkAutomation.
Querying the DSN from ThinkAutomation
So assuming you’ve installed ThinkAutomation (take a look here for more info on configuring the software), how do we perform a lookup?
Opening up the action, we can see that we have a place to add a database connection string and a query for our lookup.
In here, we have specified the DSN we are using to connect to our spreadsheet with ‘DSN=’ followed by the name we gave to the DSN. Then we are adding a very simple SQL statement which will get all the rows from Sheet1 (Notice when we are querying from excel, we wrap the sheet name in ‘[ $]’; a quirk of the Excel driver)
Finally, we are telling ThinkAutomation which columns we want to parse out, and where we want the values to go. So in this case, we will take the result from Column1 (hopefully ‘Hello’) and place it in a ThinkAutomation variable called %Col1% and then the same for Column2 into a ThinkAutomation variable called %Col2%. We can then use these values for any automation tasks we wish to perform on the data we are extracting from the spreadsheet.
And lo and behold, if I test the query manually in ThinkAutomation……
As mentioned before, this may only be part of a larger automation project. Depending on what data we are automatically pulling from a spreadsheet, there could be many different things we want to do with the data. Adding it to an email to send out to various teams, using the data in a report to be published automatically, sending the data to a web service, or even posting it on Facebook (actions which can all be achieved with ThinkAutomation.)
So the next time you ponder the move to a large-scale database solution, consider sticking with the trusty old spreadsheet. It may be simpler!