Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

Archive for the ‘Google’ Category

Another Look At Google BigQuery

with 11 comments

About a year ago I wrote a post looking at Google BigQuery which finished on a bum note when I ran into a limitation with the size of tables that could be used in a join. Recently I found out that that particular limitation had been lifted with the introduction of the new JOIN EACH operator, and since my account was still active and the data was still uploaded, I thought I’d see if my query could be made to run.

Just as a bit of background (if you can’t be bothered to read my previous post): the query runs on a sample data set originally created by Vertica (see here) that consists of a 1.2GB csv file with two integer columns and 86,220,856 rows. The rows represent the edges in a social network, and the aim of the query is to find the number of triangles – the number of cases where person A is connected to person B, B is connected to C and A is also connected to C. The query joins this large table to itself not once, but twice.

Anyway, to cut a long story short the query did work this time. Here’s the new version, rewritten to use JOIN EACH:

select count(*)
from
(select
e2.Source as e2Source, e2.Destination as e2Destination,
e3.Source as e3Source, e3.Destination as e3Destination
from
(select * from [Edges.EdgesFull]) as e2
join each
(select * from [Edges.EdgesFull]) as e3
on e2.destination = e3.source
where e2.source < e3.source) as e4
join each
(select * from [Edges.EdgesFull]) as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source
where e1.source < e4.e2source

Here’s the evidence:

image

It took 98.7 seconds to run, which I think is pretty good. Some remarks:

  • I’m not 100% sure this is the correct result, but I’m fairly confident.
  • I’m a long way from being 100% sure this is the most efficient way to write the query.
  • When the original Vertica blog post came out it sparked a p*ssing contest between vendors who wanted to to show they could do better than Vertica. You can see an example of other people’s results here; clearly other platforms could do a lot better, and remember these results are two years old.
  • Obviously you can’t judge the performance of a database on one query.

That’s all. My mind is now at rest, and I’ll go back to being a Microsoft fanboy tomorrow.

Written by Chris Webb

July 5, 2013 at 12:00 pm

Posted in Google

Tagged with

A Look At Google BigQuery

with 6 comments

Over the years I’ve written quite a few posts about Google’s BI capabilities. Google never seems to get mentioned much as a BI tools vendor but to me it’s clear that it’s doing a lot in this area and is consciously building up its capabilities; you only need to look at things like Fusion Tables (check out these recently-added features), Google Refine and of course Google Docs to see that it’s pursuing a self-service, information-worker-led vision of BI that’s very similar to the one that Microsoft is pursuing with PowerPivot and Data Explorer.

Earlier this month Google announced the launch of BigQuery and I decided to take a look. Why would a Microsoft BI loyalist like me want to do this, you ask? Well, there are a number of reasons:

  • It looked like fun.
  • It’s actually described by Google themselves as an OLAP system here. I don’t agree with this classification – I think it’s better to describe it as an analytical database – but it was enough to make an old OLAP-fan like me curious. It’s also a column-store database. While it’s hardly a complete BI solution in itself it could easily be used as the back-end for one and I believe the French BI vendor BIME already support it as a data source.
  • I’ve argued in the past the only reason that anyone would want to do cloud-based BI would be to scale way beyond whatever on premises hardware they could afford, and Google does make some impressive claims for its scalability: it should be able to provide quick response times for queries on terabytes of data. After all, if I can handle hundreds of millions of rows of data in PowerPivot on my laptop then I’m only going to be tempted to use a cloud-based solution if I need to work with much larger data volumes.
  • It’s based on Dremel, and I’ve heard that one of the developers that works on Dremel is someone we used to know well in the world of Microsoft BI.

To test it out I thought I’d see how it handled the largest csv file I happened to have handy: a 1.2 GB dataset consisting of two integer columns and 86,220,856 rows. I wasn’t able to load the file direct into BigQuery because it was too large (small files you can load in direct), so I first had to upload it into Google Cloud Storage and then loaded it into BigQuery. The upload into Google Cloud Storage took about an hour, and once that was complete it took about 15 minutes to load it into BigQuery. Here’s the dialog for loading data into BigQuery – it’s pretty straightforward, as you can see:

image

The first query I ran was a simple count(*) to find the number of rows in the table, and that took a respectable 2.9 seconds:

image

Finding the number of distinct values in one of the columns was a little slower, at 3.9 seconds:

image

As was getting the sum of the values in a column, at 4.0 seconds:

image

While not astounding fast it, the fact that all these queries executed in under 5 seconds means that performance is good enough for ad hoc data analysis.

Next, I thought I’d have a crack at the problem that this dataset was actually intended for: the scenario described in this post on the Vertica blog from last year:

http://www.vertica.com/2011/09/21/counting-triangles/

Ignoring the my-technology-is-better-than-yours tone of this post, I thought this was an interesting problem: easy to understand but difficult to calculate quickly, and a good test for any product with pretensions to being a ‘big data’ platform. The two columns in the csv file are meant to represent the ids of members of a social network, and with each row representing a relationship between two people. The problem to solve is to find how many ‘triangles’ there are in the data, ie situations where person A has a relationship with person B, person B has a relationship with person C and person C also has a relationship with person A.

It was fairly easy to adapt the SQL in the blog post to the BigQuery syntax, the only slight problem being that it only supports joins between two tables at a time and so you have to use subselects:

select count(*)
from
(select
e2.Source as e2Source, e2.Destination as e2Destination,
e3.Source as e3Source, e3.Destination as e3Destination
from
(select * from [Edges.EdgesFull]) as e2
join
(select * from [Edges.EdgesFull]) as e3
on e2.destination = e3.source
where e2.source < e3.source) as e4
join
(select * from [Edges.EdgesFull]) as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source
where e1.source < e4.e2source

 

On a small test dataset everything worked OK, but on the full dataset I ran the query and… got an error:

image

Reading the small print in the docs I realised I’d run into the following limitation:

BigQuery provides real-time query performance for JOIN statements where one of the two sides is small. Here small means less than 8MB of compressed data; compression rates vary, but are usually in the range of 2-10X as compared to the corresponding CSV. We expect this 8MB limit to continue to increase over time. For simplicity, we always require the table on the right side of any JOIN clause to be small. Outer joins with a small table on the "outer" side are unsupported, which means that we only support left outer joins.

Clearly my entire table was going to be much more than this limit. I had a go at filtering the data so that the first column was less than 2000 (the max value in each column is 4,847,570) and that worked, returning in 23.1 seconds:

select count(*)
from
(select
e2.Source as e2Source, e2.Destination as e2Destination,
e3.Source as e3Source, e3.Destination as e3Destination
from
(select * from [Edges.EdgesFull] where source<2000) as e2
join
(select * from [Edges.EdgesFull] where source<2000) as e3
on e2.destination = e3.source 
where e2.source < e3.source) as e4
join
(select * from [Edges.EdgesFull] where source<2000) as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source 
where e1.source < e4.e2source
   
image

A bit disappointing, but maybe this is a problem more suited to Pregel than Dremel? Certainly in more traditional OLAP scenarios when you need to join a fact table to a dimension table, many dimension tables will be smaller than 8MB when compressed so this limitation wouldn’t be such an issue.

Overall I was impressed with the performance and ease-of-use of BigQuery, and I’ll be interested to see how it develops in the future and integrates with the rest of the Google stack (it’s already possible to hook it up to Google docs with a bit of coding). I will, of course, be equally interested to see what Microsoft’s cloud BI and Office strategy comes up with to counter this.

Written by Chris Webb

May 20, 2012 at 11:02 pm

Posted in Google

Tagged with

Google Docs gets pivot tables

with one comment

Not much to say, really, apart from the fact that you can now use pivot tables in Google Docs:
http://googledocs.blogspot.com/2011/05/summarize-your-data-with-pivot-tables.html

Yet another small step closer to Google having something resembling a BI solution… integrate Fusion Tables, Public Data Explorer, BigQuery and this and we’ll have something interesting, although I’m not convinced Google has a coherent BI strategy or thinks it needs one.

Written by Chris Webb

May 18, 2011 at 6:51 am

Posted in Google

Google Public Data Explorer

leave a comment »

Over the last few years I’ve been tracking Google’s slow progress towards offering a cloud-based BI solution. Here’s a new development: I see from the Official Google Blog that you can now upload your own data to the Google Public Data Explorer (which I blogged about last year):

http://googleblog.blogspot.com/2011/02/visualize-your-own-data-in-google.html

There’s more background here:

http://www.niemanlab.org/2011/02/dataviz-democratized-google-opens-public-data-explorer/

How long will it be before it can access data from BigQuery, and is integrated into Google Docs I wonder?

Written by Chris Webb

February 16, 2011 at 11:12 pm

Posted in Google

Tagged with

Google Docs can access data from BigQuery

with 5 comments

I don’t know how I missed this news from last week (luckily I saw it relayed here), but this is important: Google have announced that you’ll be able to access data stored in BigQuery direct from Google Docs:
http://googledata.org/google-docs/bigquery-meet-google-spreadsheets/

I’ve been following BigQuery and Google’s BI efforts for a while, and it looks like it’s only a matter of time before they have something resembling a pivot table built into a Google Docs spreadsheet – and that would be a very significant development in the world of web-based BI. Think about it: vast amounts of data stored in BigQuery, accessible through a SQL-like interface and able to be queried very quickly; a spreadsheet on the front; all that’s needed is some kind of OLAP-like-layer in between to make it easy for end users to build their own queries. And Google have done similar things to what’s necessary here with Fusion Tables and their collaboration with Panorama

Microsoft really need an answer to this. We know SSAS in the cloud is coming, and we have the Excel web app, but again we need pivot table support in the web app to complete the picture (see here and here for more discussion on this topic). I hope it comes soon!

Written by Chris Webb

January 12, 2011 at 4:57 pm

Posted in Google

Tagged with

Google Books Ngram Viewer

with 2 comments

This has been blogged about quite a bit over the last few days, but if you’re looking to kill some time at work with something vaguely BI-related then take a look at Books Ngram Viewer, the latest thing on Google Labs:

http://ngrams.googlelabs.com/

It’s pretty simple: it allows you to visualise the number of occurrences of words in books held in Google books by year of publication, and therefore track the popularity of ideas and concepts over time (well, that’s the theory). You can even download the source data – if I’ve got some spare time it might be interesting to see this data in PowerPivot…!

There are already some interesting examples of things you can see with this data out there, for example on the Information Is Beautiful site, and on the Google Blog post announcing it. I can’t resist adding one more, though: Microsoft vs Oracle vs IBM vs Google – notice how IBM peaks in about 1988 then declines; Oracle and Microsoft have a similar peak in 2004 and then decline; while Google is still on the up…

Written by Chris Webb

December 20, 2010 at 11:37 pm

Posted in Google

Google Public Data Explorer

with 2 comments

Google made yet another tentative step into the world of web-based BI with the launch of Public Data Explorer, a tool for analysing at least some of those public data sets that have been appearing thick and fast over the last year or so. Although it’s very fairly basic compared to other tools out there like (to pick two examples from many) Tableau Public or Timetric, it’s actually quite fun and much better than other Google efforts in this area like Fusion Tables. There’s a very limited number of data sets available at the moment and there aren’t many options for slicing and dicing, but the functionality that is there is quite slick – I especially like the way you can zoom in and out along the time dimension. Here’s a screenshot showing some European unemployment data:

image

Here’s a link to that analysis if you’d like to play with it yourself.

Of course, this will remain a toy until you can get more data into it: all the public data that’s available anywhere on the web, data that’s in Google spreadsheets, and maybe any data in more ‘difficult’ formats such as html tables in web pages (though that’s when you really need a complex tool like Kapow to extract it) or even any data that’s returned when you run a search (which Google has been doing to a limited extent for a year or so now; Public Data Explorer builds on this).

I’m surprised Wolfram Alpha hasn’t done something like this already; I wonder if Microsoft has something up its sleeve here too? After all it has a partnership with Wolfram Alpha to get data from there, and in Bing a search engine in search of differentiating features. Maybe Live Labs Pivot is part of the story? And given that it’s likely there’ll be some kind of cloud-based Analysis Services in the future, which I’m willing to bet would actually be more like a cloud-based PowerPivot and which will presumably work with Excel on the web, I can see Microsoft could have quite a strong story if it wanted. But this is all conjecture, of course. I wonder if there’s anyone on the Bing team who understands BI? No, I’m really making wild guesses now, so I think I’ll finish here…!

Written by Chris Webb

March 9, 2010 at 10:47 pm

Posted in Google

Google Fusion Tables

with 3 comments

Well, well, well… another week, another BI-related announcement from Google. Jamie Thomson just brought my attention to Google Fusion Tables which got released this week with almost no fanfare (maybe Google wanted to avoid the kind of backlash they got with Google Squared?). Jamie’s first comment was pretty much inline with what I thought: this looks a lot like a basic version of Gemini, or indeed any other DIY BI tool. Basically you upload data, you can filter it, aggregate it, edit it and even join datasets together; then you can format the results as tables, maps, charts and so on and share the results with other people. You can find out more about how it works here:
http://tables.googlelabs.com/public/faq.html
http://googleresearch.blogspot.com/2009/06/google-fusion-tables.html

So, even though I’ve got loads to do today I had to check it out, didn’t I? Google provide a number of different free datasets for you to play with, but I thought I’d have a go with some data about the hot topic of the moment here in the UK: MP’s expenses. This data is available in Google spreadsheet form – ideal for loading into Fusion Tables – from the Guardian data store site:
http://www.guardian.co.uk/news/datablog/2009/may/08/mps-expenses-houseofcommons

After a bit of trial and error (and Fusion Tables is definitely prone to errors – although of course it is a beta) I managed to create a view that shows the average value of MP’s expense claims, excluding travel expenses, as a bar chart. I’m supposed to be able to share it here and I’ve got the HTML, but at the time of writing I can’t get the gadget to embed in this blog post. When I do, I’ll update this post to include it. In the meantime here’s a screenshot:

image  

Nevertheless, it’s fun even if it’s not quite a useful business tool yet. But hmmm… is it just me or does Google have some kind of BI strategy?

UPDATE: this article has a little more detail on the technology behind it:
http://www.itworld.com/saas/69183/watch-out-oracle-google-tests-cloud-based-database
although I think it’s a bit premature to say that this is going to kill Oracle, Microsoft and IBM…

Written by Chris Webb

June 12, 2009 at 2:21 pm

Posted in Google

Follow

Get every new post delivered to your Inbox.

Join 2,859 other followers