Metadata, semantic web technologies and, yes, Gemini again
I saw a very interesting article the other day in Intelligent Enterprise by Seth Grimes, about a newly-published report on the semantic web by David Provost. Grimes is rightly sceptical about how close we are to these ideas reaching fruition and notes that many of the companies mentioned in the report are concentrating on ‘semantic data integration’. Frankly, to me the idea of being able to integrate data sourced from different parts of the web is still far-fetched, given the trials you have to go through to integrate data from different parts of the same company. But it did get me thinking: when users get Gemini, where will they get their data from? Yes, they’ll be downloading data on ‘industry trends’ etc from the web in the way we saw in the Gemini demo, but it won’t be that often. In well-run companies most of the time the data will come from four sources:
- The data warehouse, either from the relational source or Analysis Services.
- OLTP systems
- Data that lives in someone’s small, well-maintained, official, IT-department tolerated Excel spreadsheet. The kind of spreadsheet that couldn’t and shouldn’t be promoted to database form because it’s too small, or short-lived, or needs to be maintained by non-technical people, or needs the complete flexibility that Excel gives you. In fact, exactly the kind of thing that Excel is meant to be used for.
- Data that originally came from the data warehouse but was downloaded into someone’s local Access database, or exported to Excel, or sent to the user in something like a SSRS report; so data that has come to the user second hand.
How, then, can Gemini know (beyond its cleverness with column names and analysis of the data within those columns) what data can be integrated with what? Ideally it would have access to some form of metadata embedded in the source data that would help it make the correct decision all in all cases. Where this metadata comes from is a problem much larger than Gemini of course, and goes down to the fundamental question of how an enterprise can keep track of all of its data assets wherever they’re stored and understand what the data in each data store actually means; Microsoft is regularly criticised for its lack of a metadata tool and I guess MS is working on something in this area. If, in the first and second scenarios above, Gemini could connect to SQL Server or Analysis Services and see a common layer of metadata that would help it out, allowing it to join data from SQL Server with data from Analysis Services for instance. In a way, Analysis Services is already a metadata layer on top of the data warehouse, containing information on how tables need to be joined and how measures should be aggregated; perhaps this side of it will become more important as the MOLAP engine is superceded?
I also think some of the technologies of the semantic web, when applied to the enterprise, could be very useful here; I’ve only really just come across this stuff myself, but as an introduction I found the one-page explanation of RDF on Twine was very good, and the Microformats site was also full of interesting information. One of the companies mentioned in the semantic web report is Cambridge Semantics, whose product Anzo for Excel is aimed at imposing structure on all of those Excel spreadsheets I referred to in the third bullet point above. The demo on the web concentrates on using the tool for sharing data and collaboration, but as far as I can see the key to getting that to work is selecting data in the spreadsheet and linking it back to a common metadata layer. Of course the obvious criticism of a product like this is that you’re again relying on your users to make the effort to mark up their data and do so properly, but if there’s an incentive in the form of easier collaboration and – to put it bluntly – less of that boring cutting and pasting then maybe there’s a chance they could be persuaded to do so. I could imagine MS offering something very similar to this as part of Excel, with all the central management being done through Sharepoint, and extended throughout the Office suite to Access, Word etc. Gemini would then be able to do a better job of understanding what and how data in an Excel spreadsheet, say, could be integrated with data in the data warehouse.
So the bottom-up approach could be combined with the more traditional top-down metadata management approach, and crucially I’d want to see metadata automatically embedded in various the output formats of the server products. For example when you built a SSRS report or created an Analysis Services pivot table in Excel, in both cases I’d want the data by default to retain some record of what it was and where it had come from – the metadata would embedded in the document in the same way as if the user had marked up the document themselves. And this in turn would make it a lot more easily reusable and comprehensible by Gemini and other applications such as Enterprise Search. Metadata would accompany data as it travelled from document to document, data store to data store, format to format. This would then cover the scenario in the third bullet point, allowing you to integrate data from an SSRS report with data from the data warehouse it originally came from, or data from a user-generated Excel spreadsheet that had been marked up manually.
So much for structured data; ideally we’d want to be able to include unstructured data too. Some of the applications of natural language processing mentioned in the semantic web report looked very interesting, especially OpenCalais (which already has integration with MOSS 2007 I see). If you were looking at sales figures for a particular customer in Gemini, wouldn’t you also want to be able to look for documents and web pages that discussed sales for that customer too? And I’ve often thought that while the super-simple type-a-search-term approach for Search has worked brilliantly over the last few years, there’s a niche for a power-user interface for search too, kind of like a Proclarity Desktop for search where you could drag and drop combinations of terms from the central metadata repository and see what you found; could that be Gemini too? A tool for searching, integrating, aggregating and manipulating all enterprise data, not just numeric data?
Over the last few days I’ve seen the Gemini team (notably Amir) engaging with bloggers like me about our concerns through comments on our postings. I appreciate that, but my position hasn’t changed: I think the technology is cool but there’s too great a risk that it will be misused as it stands at the moment. Relying on oversight from the IT department isn’t enough; if MS had a convincing metadata story extending to all data types and data sources, metadata that Gemini could use, it would go a long way to addressing my concerns.