Chris Webb's BI Blog

Analysis Services, MDX, PowerPivot, DAX and anything BI-related

Another Look At Google BigQuery

with 11 comments

About a year ago I wrote a post looking at Google BigQuery which finished on a bum note when I ran into a limitation with the size of tables that could be used in a join. Recently I found out that that particular limitation had been lifted with the introduction of the new JOIN EACH operator, and since my account was still active and the data was still uploaded, I thought I’d see if my query could be made to run.

Just as a bit of background (if you can’t be bothered to read my previous post): the query runs on a sample data set originally created by Vertica (see here) that consists of a 1.2GB csv file with two integer columns and 86,220,856 rows. The rows represent the edges in a social network, and the aim of the query is to find the number of triangles – the number of cases where person A is connected to person B, B is connected to C and A is also connected to C. The query joins this large table to itself not once, but twice.

Anyway, to cut a long story short the query did work this time. Here’s the new version, rewritten to use JOIN EACH:

select count(*)
from
(select
e2.Source as e2Source, e2.Destination as e2Destination,
e3.Source as e3Source, e3.Destination as e3Destination
from
(select * from [Edges.EdgesFull]) as e2
join each
(select * from [Edges.EdgesFull]) as e3
on e2.destination = e3.source
where e2.source < e3.source) as e4
join each
(select * from [Edges.EdgesFull]) as e1
on e1.destination = e4.e2source
and e4.e3destination = e1.source
where e1.source < e4.e2source

Here’s the evidence:

image

It took 98.7 seconds to run, which I think is pretty good. Some remarks:

  • I’m not 100% sure this is the correct result, but I’m fairly confident.
  • I’m a long way from being 100% sure this is the most efficient way to write the query.
  • When the original Vertica blog post came out it sparked a p*ssing contest between vendors who wanted to to show they could do better than Vertica. You can see an example of other people’s results here; clearly other platforms could do a lot better, and remember these results are two years old.
  • Obviously you can’t judge the performance of a database on one query.

That’s all. My mind is now at rest, and I’ll go back to being a Microsoft fanboy tomorrow.

Written by Chris Webb

July 5, 2013 at 12:00 pm

Posted in Google

Tagged with

11 Responses

Subscribe to comments with RSS.

  1. Feels like good old times to read in Chris blog about features I implemented. Chris – please venture more outside of Microsoft walled BI garden, there are many interesting things there!

    mosha

    July 5, 2013 at 9:29 pm

    • What would really make it like old times would be if you rewrote the query so it executed in 0.00000001 seconds!

      Chris Webb

      July 5, 2013 at 9:42 pm

  2. Cool! I posted a link to this post in http://www.reddit.com/r/bigquery/comments/1iggfa/counting_triangles_in_a_social_graph_the_vertica/.

    Can you make the dataset public? Then everyone can play with this query :).

    (sharing is free in BQ)

    • Hi Felipe,

      You can get the source data from here: https://github.com/vertica/Graph-Analytics—-Triangle-Counting

      Regards,

      Chris

      Chris Webb

      July 17, 2013 at 10:31 am

      • Oh yes, raw data is cool. But a cooler feature of BigQuery (not too well known – yet), is that you can share your already ingested datasets. So if you are having the Edges.EdgesFull table still stored, it takes just 3 clicks to share it with the world: http://i.imgur.com/sSQIF50.png

        Then anyone can run the queries you wrote, without the need to download and re-upload the full dataset :).

        Thanks for doing this!

      • OK, I’ve done that now. Is there some kind of link to the dataset that I should share?

        Chris Webb

        July 17, 2013 at 11:02 pm

      • Cool! Now I just need to know your project id, to know in which namespace I can find this table.

        So instead of writing:

        select * from [Edges.EdgesFull]) as e2

        I need to write:

        select * from [your-project-id:Edges.EdgesFull]) as e2

        Thanks!

      • OK, here’s the ID:
        978003128907

        Chris Webb

        July 19, 2013 at 12:05 am

      • Thanks! The equivalent query that now anyone can run is:

        select count(*)
        from
        (select
        e2.Source as e2Source, e2.Destination as e2Destination,
        e3.Source as e3Source, e3.Destination as e3Destination
        from
        (select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e2
        join each
        (select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e3
        on e2.destination = e3.source
        where e2.source < e3.source) as e4
        join each
        (select * from [christopherpaulwebbuk:Edges.EdgesFull]) as e1
        on e1.destination = e4.e2source
        and e4.e3destination = e1.source
        where e1.source < e4.e2source

        A cleaner one would be:

        select count(*) from
        (select
        e2.Source as e2Source, e3.Destination as e3Destination
        from [christopherpaulwebbuk:Edges.EdgesFull] as e2
        join each [christopherpaulwebbuk:Edges.EdgesFull] as e3
        on e2.destination = e3.source
        where e2.source < e3.source) as e4
        join each [christopherpaulwebbuk:Edges.EdgesFull] as e1
        on e1.destination = e4.e2source
        and e4.e3destination = e1.source
        where e1.source < e4.e2source

        (but it doesn't get better performance, talking to the engineering team about it)

        Thanks for doing this? What t-shirt size would you like? :)

    • Oops:
      Thanks for doing this? –> Thanks for doing this!

      • No problem! If you can rewrite the query to make it faster, I would be very interested.

        Chris Webb

        July 19, 2013 at 11:52 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,081 other followers

%d bloggers like this: