r/place 2022 vs Tableau Prep & Tableau Desktop: Part 1
Reddit dropped a 21GB pixel-by-pixel dataset from r/place, so I asked how far Tableau Prep and Desktop could really take it.
- Converting bulky text fields (user IDs, colours, coordinates) to ranked numeric IDs in Tableau Prep cut the file from 21.7GB to 2.78GB, roughly a tenfold reduction, because numbers index and process far faster than text.
- Tableau Prep samples a million records for live preview, so steps feel instant even on a 21GB source, but you can't write output incrementally as the flow runs, which is why Alteryx batch macros suit very large iterative jobs better.
- Use the pause button in Tableau Desktop when building views on big data so all your changes compile into a single query rather than re-querying on every drag.
- Setting coordinate fields to AVERAGE (rather than SUM) keeps x and y positions intact for a heat map, and dropping a red colour mark to about 1% opacity reveals density where some pixels had over 100,000 edits.
- DATETRUNC to the hour buckets the timestamps so you can filter and use the Pages shelf to animate the canvas evolving over the four-day experiment.
- What the r/place dataset is0:00
- Dataset structure and file size2:04
- Loading the 21GB file into Prep2:56
- Sampling and optimising field types4:00
- Aggregate and rank to build numeric IDs7:48
- Joining IDs back and cleaning10:08
- The change-tracking modelling problem13:55
- Exporting to a Hyper extract20:03
- Results and into Tableau Desktop21:51
- Building the canvas heat map24:30
- Date truncation and filtering28:54
- Animating with the Pages shelf34:36
0:00Hey, it's Tim here. A few days ago Reddit
0:02released a data set from its 'Our Place'
0:04activity that took
0:05place at the beginning of the month.
0:07Essentially this is a community project
0:09that allows users to
0:12edit individual pixels on a canvas and the
0:15whole idea of it is that you can edit any
0:17particular
0:18part of the canvas as many times as you
0:20like and in order to get this to work you
0:22have to sort of
0:24activate your community of followers and
0:26pretty much every community on the internet
0:28ends up
0:28mobilizing a whole group of people to go in
0:31and edit individual pixels to make pictures
0:34and so
0:34what you can see in front of me here is
0:36that it's the sort of tab on the Reddit
0:38community page but
0:39actually you go to the website and they
0:41have released a data set so if you go ahead
0:44and click
0:44on this link you'll see that you actually
0:47get a complete image if I open this in a
0:49new tab you
0:50can see this is the actual image that was
0:52generated the final one before they
0:54essentially reset the
0:55canvas and it's kind of interesting because
0:58it's essentially sort of like a real life
1:00war that's
1:01taking place in the sort of online
1:03community and as people in the community
1:06mobilize their followers
1:07and you can kind of see this happening in
1:10real time so if you're sitting on Reddit
1:12watching this
1:13this is what you would have seen. Obviously
1:15it's been sped up to kind of give it more
1:17life but
1:17people are individually editing pixels and
1:20as a result of that you've kind of got this
1:22sort of
1:23real life living canvas this year they've
1:25released the data set so what you actually
1:27get in the data
1:28set is essentially the changes that have
1:31been made to each and every pixels over
1:33time and as you can
1:34see it expands over to multiple parts of
1:36the canvas and it keeps on going and sort
1:38of you get
1:39to this final place and essentially what
1:41happens at a certain point the only change
1:43you can make
1:43is the color white and so the whole canvas
1:45gets taken over by the color white and
1:47essentially
1:48that's a race to find the very last pixel
1:51that hasn't been edited that actually takes
1:53a bit of
1:54time there's some pixels you can't see here
1:56because I'm on a 4k screen this is a much
1:58bigger than 4k
1:59canvas and so it takes a while for it to
2:01actually get to the place where you've
2:03edited every pixel
2:04but nonetheless they release the data set
2:07and I downloaded it now data is quite
2:09simple it's
2:10essentially the timestamp user id the pixel
2:13color and the coordinate and it's a csv
2:15file you can
2:16download it and look at it there's a couple
2:18of quirks with it but we'll get into it as
2:19we start
2:20to look at the data so what I thought is
2:22hey what if we could work with this in
2:24tableau how far can
2:25the tableau toolset get us with this data
2:28set essentially I'd like to be able to
2:30visualize this
2:31in some way or form even if we can recreate
2:34some of that final artwork now it depends
2:37on a couple
2:38of things because this is a huge file just
2:40for context if I go open up my finder here
2:43go to the
2:43desktop where this file is you can see the
2:45the zipped file that you download from
2:47reddit is 12
2:48gig but when you unzip it because you need
2:50to do that in order to work with it it's
2:52actually 21.71
2:54gigabytes so this is what we're actually
2:56going to be using to start with and so what
2:58I'm hoping to
2:59do is get this into tableau desktop but I'm
3:01going to start with tableau prep because I
3:03know there's
3:04already efficiencies that can be made with
3:06this file set just looking at the file
3:07format I know
3:08it's a csv it doesn't make sense to try and
3:11use tableau desktop to optimize this
3:13because that's
3:14just going to be a really really slow thing
3:16to do and eventually I'll have to create
3:18the extract in
3:19tableau anyway in order to get it to work
3:21in a performant manner so let's use tableau
3:23prep to do
3:24that work then we can just open tableau and
3:26start visualizing things straight away so
3:28let's just go
3:29ahead and drag this 21 gig file into prep
3:32and see what it does so and you can
3:34actually see that I
3:35actually brought it into the canvas already
3:37let me get rid of this first one so you can
3:38just see this
3:39main one here and I've brought it in it's
3:42opened it up and we get to see the data
3:44already now
3:44throughout this video I'm just going to
3:46sort of work as fast as I can to try and
3:48make this video
3:49as honest reflection of how it's like to
3:51work with these things but I'll also try
3:53and talk about what
3:54I'm doing as I do it so if I talk quickly
3:56apologies ask questions in the comments are
3:58more than happy
3:58to sort of explain it now tableau prep is a
4:01fairly new tool and one of the smart things
4:04it does is
4:04it actually samples the data so when I go
4:06to click on this a node here essentially I
4:08'm creating a node
4:10to create a data prep step and you'll see
4:12that it actually works really really
4:13quickly and I'm
4:14actually able to already start previewing
4:16the file so some of you might be thinking
4:18hey what's going
4:19on here is it reading the entire file well
4:21for those of you who are keen on wanting to
4:23know how
4:23this is impacting my computer up here I can
4:25't actually highlight it because it's off
4:27screen so
4:28let me draw an arrow up here you can see my
4:31CPU and memory I've got an M1 max with 24
4:34cores of
4:35GPU not that that does anything in this
4:37instance but the memory is 32 gigabytes so
4:40I'm using up
4:41half my memory at the moment just working
4:44on this file now if I look at this there's
4:46a couple of
4:46things I've immediately noticed this is all
4:50text and text is okay to work with if you
4:52're in excel
4:53or something like that but when you're
4:55working with really large data sets it's
4:56not the most
4:57optimal format to have your data in because
5:00it takes quite a lot of time to index and
5:02store
5:02and work with in general numbers tend to be
5:05a lot faster in fact most computers have
5:08specific
5:08processing units which are designed to work
5:11with numbers and if you have something like
5:13a GPU those
5:14are designed to work with numbers even more
5:16just because of the way they work so if you
5:18're trying
5:18to do lots of mathematical operations
5:20sometimes your GPU can actually step in and
5:22help out
5:23and in this particular case I don't think
5:25we're doing anything that advanced no
5:27machine learning
5:28nothing like that we just sort of want to
5:30clean this data set so looking at this
5:33there's a couple
5:33of things the user ID field is this huge
5:35text field so that is immediately going to
5:38be something
5:39I'm going to target with ID numbers ID
5:41numbers are going to be whole numbers they
5:43'll be a smaller
5:44subset of numbers because even if let's say
5:46there's 200 million records in here the
5:49number
5:49200 million is going to be easier to store
5:52than this whole sort of large chunk of text
5:55essentially and again numbers are going to
5:57be much much easier to process and the date
6:00and time this
6:01is a pretty easy one I can I can solve this
6:03right now I can just set this to date and
6:05time
6:05Tablo will look at it and pass out the date
6:07and it does this in two steps you'll see
6:08that it first
6:10gives me a preview of what it's going to
6:11look like 1970 and then down here you'll
6:14see it's actually
6:14changed the dates just over here at the
6:16bottom you can see that it's actually
6:18changed them and
6:19then eventually the summary will change
6:21once it's been able to process the sample
6:23data that it has
6:24it's sampling a million records so you can
6:27see here the four fields a million records
6:30now this doesn't mean that's all the data I
6:32won't really know how many rows are in this
6:34data set
6:34until I run it so we're going to just run
6:38this and see how we get on so now if I
6:42start to look
6:42at this let's let's try and just think
6:44through this so we've got date some time
6:47and it's at a
6:49second level so we've got one hour 43 25
6:54and we've got the coordinate which is
6:57essentially an x and y
6:59coordinate separated by a comma and you've
7:02got the pixel color so what I want to do is
7:06try and
7:06get rid of any text based method of storing
7:09information in this got rid of one which
7:12was
7:12the timestamp that's been converted to a
7:14day in time we need to do this with user id
7:17pixel
7:17color and coordinate so in my mind what I'm
7:20going to do is I'm going to map each of
7:21those things to
7:22an id number and essentially we're going to
7:25give those a unique id number for each in
7:28unique record
7:29and once I've done that I'm going to bring
7:31that back into the data set so that there's
7:33a reference
7:34then I'm going to get rid of the text base
7:35fields and that would just leave us with
7:37the numerical
7:37base fields which should be a lot easier to
7:40store essentially will help me reduce the
7:42size of the
7:42file which means tableau should be more
7:44performant and should be able to aggregate
7:46things much much
7:47quicker so let's start with the aggregate
7:50step so essentially what I'm doing here is
7:53I'm using the
7:54aggregate step to just find me a unique
7:57value so we'll just bring the coordinate in
7:59here so we'll
8:00see for now there's 493 unique items and
8:04then now that we've grouped that up we can
8:07then just go
8:08ahead and rank these ranking is again my
8:10sort of de facto way of just giving them an
8:13id number for
8:14now without having to do much to them I
8:16could use row number or something like that
8:19as a function
8:20but rank is fine because these are unique
8:24we shouldn't have any sort of any sort of
8:27dense ranking where it counts the same id
8:29twice or anything like that it should just
8:31be a straight
8:32count so let's call this let's call this
8:37chord id I'll use capitals for this and
8:42just hit enter
8:44hit done we need to keep the original
8:46coordinate in the data set so we'll keep
8:48that I'll go and
8:49hit the plus sign for this and then we'll
8:53call this chord we'll call this pixel
8:57because this
8:59is where I'm going to go and aggregate the
9:01pixel color I think there's only a 30
9:03something yeah
9:04there's only 32 colors being used that's a
9:06pretty straightforward thing to add in here
9:09we'll do
9:10exactly the same we'll go and rank these
9:14and we'll go and call this pixel id number
9:19and hit enter
9:21cool so I'm capitalizing this so I can
9:23easily identify them later in the step and
9:26then the last
9:27one is we are going to aggregate we've done
9:31coordinate done pixel user id user id is
9:34just
9:34a hash user id we can't link this back to
9:36real users so let's go ahead and drag this
9:39in to group
9:40it up we don't need to aggregate by
9:42anything and then we'll go ahead add a
9:44clean step to this
9:46and we'll go ahead and rank these to get
9:48our id number we'll rank that nice and easy
9:52and whilst
9:52it's thinking about it I'll call this user
9:58id number cool so we've got the three bits
10:04of
10:04information we'll hit done and now we need
10:07to bring them together and the way I'm
10:10going to do
10:11this is actually just to rearrange my flow
10:13a little bit here add another step here
10:15just to
10:16make it a little bit sort of neater so
10:18essentially what I've done is I've broken
10:20off three strands of
10:21my flow and I should really name this one
10:25bot12 what was this what were this user
10:30okay so I know
10:31which is which and now I can actually just
10:33bring each of these back together to the
10:35main data set
10:36there's no sort of multi-join as something
10:39like altrix has so I can drag this to the
10:42join and it
10:42will automatically map a user id to user id
10:45the join is correct it's doing an inner
10:47join but it
10:48should be an exact match so you could do a
10:50left join or inner join it would be the
10:51same result
10:52and keep going in here I'm not even going
10:55to look at these until the very final step
10:58essentially I'm
10:59just bringing everything back together
11:01again it doesn't really matter which order
11:03I do it in but
11:03this this sort of looks more visually
11:05appealing and if I go to the final clean
11:07step we should have
11:08all our data so everything in capitals has
11:11been added up to the timestamp and
11:13everything after
11:13should either be a duplicate record or some
11:16of the information that I don't need and
11:19what we should
11:22see is roughly the same number of records
11:25so if I go to the coordinate id so 493 and
11:29if I go to
11:30the coordinate 493 so I'm getting the same
11:33number of records in this sample so that
11:35means I've not
11:36caused any duplication 492 656 values I
11:40should get the same number here yes exactly
11:43so there's
11:44no duplication at any point which is good
11:47and now yeah now we can actually start to
11:50do some cleansing
11:51so in order to do this in order to get rid
11:52of these columns I'm actually going to
11:54switch to
11:54not this table view here but this list view
11:57and this list view tells me all the fields
11:59in my data
11:59set so I can now just actually start
12:02dumping pretty much everything but the
12:05coordinates because I still
12:07need to work on the coordinates so that's
12:09what we'll leave and then we'll switch back
12:12to this
12:12summary view and now that we've got this I
12:14need to split these coordinates because
12:17essentially it's an
12:17x and y so let's go ahead and do that let's
12:21go ahead and do a custom split actually not
12:25a custom
12:25split just an automatic split I don't know
12:27why I'm thinking about it because it's a
12:29comma so
12:30tablet will handle it pretty well and we'll
12:33call this chord x and we'll call this chord
12:39y
12:40and now we have gotten rid if we remove
12:44this last coordinate by just going to those
12:47three dots
12:48it's actually behind my face here if I move
12:50this to the left there's remove let's just
12:52remove that
12:53put this back and yeah we've got rid of
12:55every numeric field so in a pretty short
12:59amount of time
12:59we've actually built a pretty nice data set
13:02now so we've got if I just look at this
13:04summary view
13:05we've got all numerical fields and a pretty
13:08good set of field I've done some typos here
13:11so
13:12what what should we call this so this
13:14coordinate id where we're doing it here I
13:19clearly named it
13:20incorrectly so let me just edit this and
13:22what you can do is you can sometimes go
13:24back and just
13:24edit the step where you made the change
13:27this is a better thing to do because if you
13:29just go and
13:30rename it later down the line when someone
13:32comes in corrects the typo it breaks the
13:34flow in the
13:34future you don't want to do that you'll
13:36fall into that mistake as well you'll open
13:38it on maybe three
13:39weeks time you'll completely forget you
13:41made a hash of it change it break it and
13:42then you'll
13:43spend ages figuring out why it doesn't work
13:46when it's only a typo so now that
13:47everything is correct
13:49we're pretty much good to go so now when I
13:52look at this data set let's just look let's
13:54just look at
13:55this in this view here this is an
13:57interesting data set because this is like a
14:00data set that
14:02tracks changes it's not a data set that
14:04shows me the picture at any moment in time
14:06so the way the
14:08canvas works is it starts white and then it
14:10slowly basically changes because people are
14:14coming in
14:14and making changes and what data set they
14:17've shared with us is the changes so imagine
14:19I give
14:20you a word doc but you start with a blank
14:22word doc and then I give you a list of all
14:25the changes that
14:26have been made and it's up to you to figure
14:28out what the final word doc looks like that
14:30's
14:30essentially what's going on here we've got
14:32the list of changes and we've got an
14:34assumed starting
14:34point and this data set tells us how to get
14:37to the sort of final picture or the picture
14:39any moment in
14:39time now if I was to try and create a data
14:42set to do this in Tableau it's slightly
14:44difficult because
14:46in essence I'm trying to think to do this
14:51in essence I'd need some sort of batch
14:55operation
14:56in order to set this up because if I wanted
14:59to visualize this in Tableau if I just
15:01think about
15:01this mentally I always close my eyes when I
15:04do this but I have to get the get get the
15:07data
15:08and create a time scaffold that starts from
15:10the beginning to the end of the sequence of
15:12activity
15:13and then having done that I need to go and
15:16get for each second hour a minute whatever
15:19level of
15:19detail I want to work at I need to go and
15:23get the the last change or well I need to
15:28go and get the
15:29last relevant change for that unit of time
15:31so if I'm looking at a minute level I need
15:33to go and get
15:34all the changes and look and see what the
15:36last change was for that particular minute
15:38and then
15:39I'll have a picture of what that should be
15:41at that moment in time and that's what my
15:43data should look
15:44like for all the coordinates now I'm
15:46looking at the coordinates and the
15:48coordinates are a bit
15:49strange here because I don't think we're
15:52getting the full set of coordinates and now
15:55if I go to
15:56reddit you can actually download an image
15:59this is the image and this is a true scale
16:03and if I
16:05save the image to my media folder you can
16:10save it at 8x but I'm not going to do that
16:14because
16:14then that will give me the wrong size in
16:18pixels so this is the this is the one I'm
16:22looking at
16:23if I get the information it is where is the
16:26pixel size 2000 by 2000 so that is if I get
16:32my phone
16:33let's do the map uh 2000 by 2000 it's four
16:40million coordinate points so that's already
16:45quite a lot of
16:47pixels to sort of that's four megapixels
16:50essentially to to manage um so that's a lot
16:54of coordinates to think about now if you're
16:57going to create that data in in any tool
17:00like the the
17:00easiest way to manage that as a human being
17:04would be to have uh all the coordinates
17:06going from left
17:07to right and then the time going from top
17:09to bottom and then you basically for each
17:11time slot
17:11you just have the color in every coordinate
17:14that's not an efficient way to store this
17:16data set because
17:16you then have four million uh width table
17:20that is not a sort of reasonable so what
17:22you can do instead
17:23is you can pivot those dates to a vertical
17:26format so for each time slot you have a
17:28vertical list of
17:29coordinates and the associated uh pixels
17:32essentially so if I go back to uh Tablo
17:35prep here I'm talking
17:36to thin air here I'll go back to Tablo prep
17:39essentially have it vertically so there's a
17:41lot
17:41of sorting and there's a lot of sort of
17:43process going on but it gets more complex
17:45because of
17:46course we have to make the assumption that
17:49it's quite possible for um let's say
17:52thousands of
17:53people to edit a pixel at the same time
17:56because I mean these are computers so it is
17:58quite easy
17:59for let's say 10 people to try and edit the
18:02same pixel at the same time and if it's
18:04tracking when
18:05those you know pixels are coming in then
18:08and if it's tracking it to the millisecond
18:10it's quite
18:10easy you could get thousands of edits
18:12within one second that's how milliseconds
18:15work so
18:16what I have to also then do is go and see
18:18all the changes within any particular time
18:21period that I'm
18:22aggregating to and get the last change so I
18:25need some level of sorting inside of each
18:29time unit
18:30inside of each coordinate essentially to
18:32try and get the final picture for that
18:33particular time
18:34unit it's quite complex um I'm thinking
18:37there's probably a couple ways I'll do this
18:40in prep
18:41but I'm just thinking about it and I think
18:44I think for what I want to do in this video
18:47at least for
18:47let's call this phase one the more I think
18:49of it the more I think this requires a
18:51little bit more
18:52sort of thought I need to approach this in
18:55a very sort of efficient way and I need to
18:58go and test
18:59this in Tableau to make sure that I'm at
19:01least on the right lines and what I've got
19:02here is actually
19:03going to work so I'm going to leave this
19:06here and I'm going to export this into a
19:08file and I'll think
19:11more about that data problem and think if
19:13actually prep is the right place to try and
19:16do it because
19:17I feel like I'm going to hit a wall pretty
19:19quickly with Tableau prep because it's not
19:23limitations but
19:24it's not as mature as something like Altrix
19:26and at least in Altrix I know a way I could
19:28do this
19:29using a batch macro essentially chop the
19:32problem down into time or coordinate
19:35constituents and then
19:36now that I've done that I can actually
19:39write out the output for each part of the
19:41flow as it happens
19:43so if I leave this running for five hours
19:45and it fails after four hours I haven't
19:47lost four hours
19:48of work it's actually been writing those to
19:50file prep can't do this I have to
19:52essentially run the
19:53whole flow and wait for it to run so I'd
19:55quite like to be able to output the outcome
19:57as it's
19:58going so if it crashes I can run it again
20:00it's inevitable something's going to break
20:02here so
20:03I've just justified to myself now that I'm
20:05not going to do this in prep so we'll just
20:08output
20:08what we have to a hyper file let's go over
20:12to the desktop here let's go put this our
20:16place and for
20:18this next phase I think I'm going to have
20:21to I think I'm going to have to do this on
20:25Windows
20:26rather than on the Mac so I'll explain why
20:29in a second call this v1 hit accept and hit
20:34run so
20:35while this is running this is going to take
20:37time no doubt the reason I'm going to have
20:39to do some
20:40windows is the Mac version of tableau is
20:42still not optimized for Mac so it's still
20:46running Rosetta
20:47and because of that it's actually a little
20:49bit slower it can be temperamental and I
20:51don't want
20:52that to be the reason this doesn't work and
20:54secondly I think my computer is actually
20:56going
20:57to be faster it's got more resources more
20:59RAM and more CPU and the whole lot so it
21:02will actually be
21:02able to I think keep up with the scale of
21:04this data which the moment is 21 gig but we
21:07'll see
21:07what tableau prep writes this to when it's
21:10done very very shortly so this is going to
21:12take a bit
21:12of time my guess is probably up to half an
21:15hour given that this is like I don't know
21:18how big this
21:18is a million records would be about 100 meg
21:23so I definitely think that something as big
21:27as this
21:28data is going to be much much bigger
21:30bearing in mind it's also saved as text we
21:32don't know what
21:33it's going to get compressed to so we'll
21:35leave this running and yeah when it's done
21:37I'll be back
21:38and we'll walk you through the final part
21:39of the video where we'll try it up in table
21:41au
21:41so
21:51okay so it's the next day I actually took a
21:55break the file took about 33 minutes to run
21:59we ended up with 160 million records I
22:02actually go to my folder here you can see
22:04that I have the
22:05hyper file the extracted bar and we went
22:08from 21.71 gigabytes to 2.78 gigabytes that
22:12's almost
22:12a factor of 10 in terms of reducing the
22:15size of that data right down purely by just
22:17optimizing
22:18changing things numerical fields and
22:19optimizing now if I hadn't done that I
22:21probably would have
22:22gotten closer to about half that number so
22:24about 12 gigabytes just because when table
22:27au creates an
22:28extract it's actually also zipping it and
22:30compressing it itself so that's actually
22:32the
22:32fact maybe we'll try and do that some other
22:34time but we now have the file we're ready
22:36to go and
22:37work with it over in tableau now the thing
22:40I'm going to do is I'm not going to use
22:42tableau on my
22:43m1 mac now the tableau software itself is
22:46actually works fine on an m1 mac the
22:49problem is when it
22:50comes to data sources the compatibility isn
22:52't sort of 100 and there's sometimes quirks
22:55in terms of
22:57performance things just running slower and
22:59things not being 100 and I don't want to
23:01get in the way
23:02of sort of just showing what you can do
23:03with tableau on what most people have which
23:06is a
23:06windows laptop especially in a business
23:08context so what I'm going to do is I'm
23:09going to remote into
23:11my desktop pc which is just literally near
23:13here and I just do that through microsoft
23:15remote desktop
23:16and when we do that we just open up this
23:18particular tab here now what I'm going to
23:20do is I'm going to
23:21close this tab here and that was actually
23:23like a sample workbook for this video and
23:25now what we're
23:26going to do we're going to start from
23:27scratch I'm going to show you how to open
23:29this file and we're
23:30just going to get going the file itself is
23:31already on my desktop so we can actually
23:33start working
23:34with this and if we go ahead and open up
23:36this file you'll see that this actually
23:38also ran at midnight
23:40so we'll just open this and we'll get there
23:42a little bit quicker and now you can see we
23:45've got
23:45the core data set sort of pretty much ready
23:48to go everything is sort of set out as we
23:50expected it
23:51and the fact that it opened really quickly
23:53is really positive the main reason that is
23:55is because
23:56essentially if Tableau can open the file
23:58and we can get straight into it then we
24:00know we can
24:00already start to work with at least a very
24:02basic level of aggregation so for example I
24:05want to
24:05count the number of records in this data
24:07set I can actually just go ahead put it in
24:09and 160 million
24:10comes back almost instantaneously I want to
24:12break this down by pixel id which is
24:14essentially the
24:15different colors then again we get an
24:17almost instant response so for at least
24:19very ad hoc
24:20analysis this is working absolutely fine
24:22you wouldn't even know the difference
24:23between this
24:24size data set and maybe an excel file
24:26everything is working as you'd expect now
24:29for the next thing
24:30what I want to do is actually start
24:31building something and at least for this
24:33video what I'll do
24:34I just want to recreate some sort of canvas
24:36if that makes sense and I want to be smart
24:38about this
24:38because there's a couple ways of working in
24:40Tableau when you work with large data sets
24:42sometimes
24:43you have to get a little bit smart and
24:44there's this feature here on the top left
24:47where you can
24:47pause things now what it means essentially
24:50is every time you do something in Tableau
24:52it's not
24:52going to try and react and build what you
24:54're trying to do and this is actually useful
24:56because when
24:56you're working with big data sets you've
24:58got to think of it like a set of
24:59instructions would you
25:00rather give someone a big list of
25:02instructions so they can look at it and
25:03optimize how they're going
25:04to get all that work done would you rather
25:07keep giving it to them one bit at a time
25:09for them to
25:09only then get frustrated I could have done
25:11this in a different way if you'd given me
25:12all the tasks
25:13ahead of time and that's essentially what's
25:15happening here by hitting the pause we're
25:17basically going to build our visualization
25:18completely blind and then when we're done
25:20we'll
25:20hit play and Tableau will take all of those
25:22instructions and process them in one go
25:25that
25:25means we run a query once and then it
25:26renders everything once rather than having
25:29to do it every
25:29single time we drag something on so what
25:31you've got to realize is every time I do
25:33something it's
25:34not having to do that work until I'm
25:36completely done so it ends up being a lot
25:38faster so for this
25:38one we're going to start by bringing the
25:41coordinates in I'll put the x on the
25:43columns and
25:44y on the rows now Tableau has a tendency to
25:47want to aggregate everything which for some
25:49people is
25:50really weird but actually most analytical
25:52context that's literally how most analytics
25:54is done
25:54aggregation happens nearly all the time
25:56unless you're looking at row level data at
25:58which case
25:59um you might as well be looking at a file
26:01in excel you don't need to come to Tableau
26:02to do that
26:03and if I go in here and break this down
26:05there's a couple of ways of aggregation now
26:08what I want
26:08to do is actually select average and I'll
26:10explain this shortly but let me just go and
26:12select both of
26:13these in essence these are the coordinates
26:15and the coordinates are unique to one
26:17particular pixel
26:19and so what I want to be able to do is
26:21number one count how many data points are
26:23behind a
26:24particular pixel but also I want to try and
26:27build a heat map and what I don't need is
26:29for it to be
26:30aggregating those coordinates in any way
26:32because I split the x and the y I don't
26:34need it adding up
26:35all the x coordinates that's just they're
26:37not going to work so by choosing an average
26:39it's
26:39basically going to calculate the average
26:41which will take that coordinate position
26:42for the x or
26:43the y and multiply by however many records
26:45they are then divide it by the number of
26:48records and
26:48then we're basically back to square one the
26:50exact same number that's essentially just
26:52just the
26:52simplicity of the map and if we go ahead
26:55and take the count here what we're going to
26:57do is we're
26:58going to put it on size this is going to
27:00shape the circle that we're going to use
27:02based on the
27:03number of records on that particular pixel
27:05so again nothing is happening visually here
27:07we're
27:08just building this blind because we've got
27:10the pause button now the size will vary but
27:12I could
27:12sort of play around with this I don't
27:14really know what I'm going to get so I'll
27:15leave it just in the
27:16middle until we actually see something and
27:18then we'll come back to it now the other
27:20thing I need
27:20to do is if I want to see things at a
27:22coordinate level at the moment it's aggreg
27:24ating everything
27:25and doing the average and so unless I tell
27:27it that I want to work at the level of
27:29detail of a
27:29coordinate it's not going to work the way I
27:31know that is there's nothing here on detail
27:34I'm showing
27:34that we're working at that level so let's
27:36go ahead and grab the coordinate id which I
27:38created in
27:39tablet prep drop that on detail and now you
27:41can see that we're working at that level of
27:43detail
27:44now that we're there we can actually start
27:47to think about what we want to do we can
27:49essentially
27:50start to visualize this almost I mean if I
27:52hit play right now we would get something
27:54but it wouldn't
27:54be that useful so I want to make a couple
27:56of changes I'm going to hit color here and
27:59I'm going
27:59to select the color red now red is an
28:01interesting color because essentially it's
28:03very good for
28:04looking at density because it's actually a
28:07really nice sort of contrast colors and if
28:09you sort of
28:10ramp the opacity down and you layer tons of
28:12red on top it's actually easier for your
28:14eye to see
28:14the differences between those sort of dens
28:17ities and if you choose other colors like
28:19blue it can
28:20actually be sometimes harder so I'm just
28:22going to drag this right down to one
28:23percent I'm doing that
28:24because I know that will actually sort of
28:26make this pop some of these coordinates
28:28have like 100,000
28:29edits behind them so in order to see that
28:32by reducing it by a factor of 100 we
28:34actually are
28:35able to see at least things that have had
28:37over you know 10,000 edits in a sort of
28:40reasonable way
28:41and now the next thing to do now that I've
28:44changed the size of change the color I
28:46probably want to
28:47make the first query I run a little bit
28:49more optimized so I want to maybe filter
28:51down to one
28:51timestamp now you could filter what I'm
28:54actually going to do is create a
28:55calculation first and the
28:56reason I'm creating a calculation because I
28:58want this to be smart and I want to be able
29:00to use it
29:00to do some sort of basic time lapsing so I
29:03'll highlight the timestamp and we'll go and
29:06look
29:06at the date functions here and we'll use
29:08the date truncation if you don't know how
29:10to use this
29:11Tableau has this really handy example I
29:13always here on the left on the right hand
29:15side so what
29:16I can do is I can highlight timestamp
29:17double click date trunk and it actually
29:19wraps that for me and
29:20then I can just go ahead and finish the
29:22function by saying hour and then putting a
29:24comma after that
29:25and now we're going to truncate essentially
29:27what this is going to do if the time is 1
29:2959 and 50
29:30seconds it's just going to chop off the 59
29:33minutes and 59 seconds and just leave it at
29:35one o'clock
29:36and the reason we want to do this is
29:37because we'd like buckets of time and
29:39essentially I want to see
29:40all the edits made within an hour which
29:42pixels which dots were edited in a specific
29:45hour what
29:46I'm hoping that does is it reduces the size
29:48of the data down bearing in mind we've got
29:50160 million
29:51records and I think this experiment ran
29:53over two days we should be reducing the
29:55data size right
29:56down so let's go ahead and do this let's
30:00call this time trunk hour and hit okay and
30:03now that we've
30:04done that we're going to drag this into the
30:06filters now this is this is sort of
30:07interesting
30:08because we've paused everything but you can
30:10actually still set the filters and I'm
30:11going to
30:12select individual dates and time because I
30:14've already truncated it so I don't need to
30:16worry
30:17about you know choosing the hours through
30:19like some sort of slider I can actually
30:21just choose
30:22this go to next and go to select from this
30:25and it's able to load those time slots and
30:27now you
30:27can see everything is pretty much aggreg
30:29ated to the hour so it started at midnight
30:31on the first
30:32of April and it ended on the 5th of April
30:34at midnight so essentially on the 4th of
30:36April at
30:37midnight so a four-day experiment and it's
30:39a really sort of good sort of scope of
30:41information
30:42to see there so we can actually just go and
30:44pick a particular uh you know hour period
30:46so I'm going
30:47to go to the middle um and we're going to
30:49pick these two so see what was happening
30:51around
30:51midnight on the 3rd of April so I'm going
30:54to hit okay that will set this filter over
30:57here and then
30:58later on we'll actually take this filter
30:59and do something else with it this
31:01particular field will
31:02do something else with it so we'll be able
31:04to play with it so now that this is done I
31:06think we're
31:06pretty much ready to go now what I will do
31:09is let me bring up my computer specs here
31:11so you can just
31:12see sort of what I'm doing this on I've got
31:13a really powerful computer here it's not
31:15like an
31:15everyday computer I use it for lots of
31:17things including editing videos gaming a
31:19whole bunch
31:20of stuff so it's slightly over specked for
31:22that particular use case um but you can see
31:24sort of the
31:25level of usage and actually when you're
31:27working with data this big um it's amazing
31:29how often
31:30businesses don't realize or don't resource
31:32their staff properly because when you're
31:34working with
31:34data that's this big it really helps to
31:37have things like memory like I've got uh 64
31:39gig of
31:40ram I've already got 13 gig in use uh just
31:43having two applications open tableau and
31:46now I'm about to
31:46run it through some crews you're going to
31:48see this uh sort of go up and down the only
31:50thing to really
31:51pay attention to is CPU and memory so let's
31:54go ahead and hit play and as it's doing
31:57that we'll
31:57just go back to this view and actually it
32:01was instantaneous I love it it didn't even
32:04it didn't
32:05even take that long I thought this would
32:06take a while I genuinely thought this would
32:08take a while
32:09um but um that's that's just crazy I can't
32:13believe that and at the bottom here we're
32:16working with
32:17532 000 records actually of course it didn
32:20't take that long um in tableau that number
32:24needs to be
32:24about 3 million before you start to see a
32:26problem and maybe it's because my computer
32:28's powerful
32:29again but um you know I've seen laptops
32:31happily handle 2 million data points as
32:34well so again
32:36I don't know why I'm surprised um but maybe
32:39again maybe I'm uh maybe maybe I'm missing
32:42something but
32:42if you don't think this is amazing uh fair
32:45game I can I totally understand why um this
32:48axis is
32:49slightly misleading though because if you
32:51think about the way the uh our place was
32:53run let's go
32:54back to uh the website let's see if we can
32:56open this up um the actual image actually
32:59we don't need
33:00to open up we can just actually look at the
33:02image let's open it here um the top left
33:05pixel the very
33:05top left is zero zero and so our axis
33:07although it's correct in traditional sense
33:10it's actually
33:11wrong so we need to edit this axis and uh
33:14reverse it essentially so that these zeros
33:16at the top and
33:17then I think this will start to look more
33:19correct and it's a square which makes sense
33:22because this
33:22whole thing was a square and um yeah now we
33:25can actually start to sort of uh see okay
33:28the other
33:28thing I've done uh I think I've done wrong
33:31here is have I got this the wrong way round
33:34I'm trying to look at this and it's gage
33:36have I got this the right way around so
33:39what I can do
33:40is I can get I can head to the very final
33:43set of time let's just let's just do this
33:47let's just go
33:47to around about here and I know there's a
33:49lot of activity towards the end so let's
33:51just hit apply
33:52and leave this filter open and now it is
33:55taking a bit of time there we go now this
33:58is the right way
33:59around what I was essentially looking for
34:00is this character here because I know
34:02roughly around this
34:03time uh the flag of France and this
34:05character there's sort of a bit of a war
34:07happening there
34:08um and between the communities so this is
34:10actually all correct now so this is working
34:13perfectly fine
34:14so this is cool um this works this concept
34:16works the thing I need to now sort of
34:19figure out is how
34:20do we build that um scaffold to allow us to
34:22actually recreate it pixel for pixel and
34:25then
34:26um having done that what I also want to do
34:29is I want to be able to um sort of animate
34:33this and
34:34we can actually try this now what I'm going
34:36to do is I'm going to grab this time and I
34:38'm not sure
34:38how this is going to go I think it will
34:40sort of panic initially and then it should
34:42run fairly
34:43quickly so let me pause this first I'm
34:45going to put this up here now the page
34:47itself is a feature
34:48in tableau that essentially takes whatever
34:50dimension you put there and treats it like
34:53a page
34:53in a book so in this case we've put the
34:55hour on there so if you take the four days
34:57of activity
34:58every hour becomes a page in a book and
35:00then what we can do with these controls on
35:02the right is
35:03essentially play through them so this will
35:05play through the changes over the uh period
35:08essentially
35:08and we can actually just not necessarily
35:10animate it but we can actually kind of see
35:12the changes
35:12in our our level sort of uh snapshot so um
35:15I can't load anything because everything
35:18the view is
35:18paused so what this is going to load the
35:21first time is going to load the first page
35:23and the first
35:24page has very few edits because when this
35:26went live the first set of edits in the
35:27first hour
35:28were just like five or six pixels before
35:30the community started to mobilize and
35:32really organize
35:33and then stuff started happening so let's
35:35go ahead and hit play and we'll see how
35:37long this takes
35:45now the paging function is going to be
35:47slightly different because
35:49I think what's going on is it loads up the
35:51different pages into some sort of
35:53frame whereas um in the filter pane it's
35:56actually you know filtering the data before
35:59it then runs
35:59a query so for this one it actually has to
36:01have the whole data set in in memory in
36:04order to
36:04understand you know which pages um go where
36:06so that's why I think this takes longer
36:08than just
36:09rendering the exact same thing but filtered
36:12um that's probably a really bad explanation
36:14I'm sure
36:14someone will be able to come on here and
36:17correct me so let's just let this finish um
36:20I think it is
36:21still uh it's not doing the query it's just
36:23computing the view but I think the paging
36:25is
36:25what sort of really complex is this and
36:27what I'll do is I'll just skip ahead to
36:29when this finishes
36:30and we'll sort of roughly gauge how long it
36:33took okay so it's loaded it doesn't look
36:35like anything's
36:36happened but if I move my mouse over here
36:38you can just see those dots there so um
36:40because they're
36:40really opaque uh all the data points are
36:42there we're just not in the hour where a
36:45lot happens
36:45so let's go over here and let's choose a an
36:48hour that's more in the middle of the park
36:51and we should
36:52get better response time now because the
36:54page itself has already loaded the filter
36:56into um sort
36:57of the view and there you go we do get a
36:59better response now what happened here is
37:01that the
37:02different quadrants of the visualization of
37:04the uh sorry the image were opened up at
37:07different times
37:07so only the top half is filled and we're
37:09actually going to just try something going
37:11to hit the play
37:12button see if this actually works so we're
37:13going to see if Tableau is actually able to
37:15play through
37:16each of these hours and we should sort of
37:18see it changing so and what I expect to see
37:21is you see
37:22yep you saw that change there that this
37:24moves and then it goes forward and then it
37:26keeps moving and
37:27if you do this over time essentially you're
37:30then able to animate this and uh sort of
37:32get it working
37:33so you can see lots of different things uh
37:35sort of turning up on the screen there's a
37:37few not
37:38safe for work sort of parts of this that's
37:40why I'm keeping this deliberately zoomed
37:42out um and yeah
37:43it's working it's completely sort of
37:45seamless now there's also less data points
37:48here so it's actually
37:49easier because the scope of the data is is
37:52sort of not that large but as the whole
37:54canvas opens up I
37:55expect this to take a little bit longer
37:57with each page and so what does happen is
37:59at some point they
38:00open up the bottom half of the canvas and
38:02there's a big explosion you can kind of see
38:04flags turning
38:04up and if I just sort of highlight this
38:07here that definitely looks like a flag
38:09these squares tend to
38:10be things like flags and so if you go look
38:13at the actual image if I actually go back
38:16over here you
38:16can see that that is indeed the I think it
38:19's the Turkish flag um and you see a couple
38:21of other
38:22squares sort of forming here and it's
38:24actually quite a cool uh sort of uh space
38:27um and yeah it
38:28just keeps going it just basically keeps
38:30keeps changing so yeah this is working um I
38:33'll hit stop
38:33for now just to stop the animation going on
38:35and what I'll do is I'll jump forward to
38:37when there's
38:38something across the whole canvas and we'll
38:40just go on this little drop down and we'll
38:42choose
38:42something maybe um I can't go right to the
38:45end because uh before the end what they did
38:47is they
38:47changed it so the only edits you could make
38:50were edits to uh change everything back to
38:52white and
38:52essentially it stayed running until
38:54absolutely everything had changed over but
38:56here you can see
38:57the whole canvas so this is working so I
38:59think I'm going to stop here for part one
39:02of the video anyway
39:03and the next thing to do is to say okay we
39:05've got this bit working in Tableau we know
39:07we can
39:07sort of visualize these dots and I'm not
39:10sure about the size of the dots once we
39:12have the
39:13different colors in we can actually try and
39:15see if we can start to recreate things at
39:17the moment
39:17all I'm doing is just basically generating
39:20a heat map and it's kind of working you can
39:22kind of see
39:23some of the data points uh working quite
39:25nicely um but in essence I don't think
39:27these circles are
39:29small enough to represent uh sort of the
39:31fidelity that we need so when we start to
39:34actually get the
39:35color in and we create the sort of data
39:37framework that we need to be able to
39:39visualize all of this
39:40then I think we'll be in there in a great
39:43place so yeah um Tableau completely handled
39:46this data set
39:47I think it's really good that Tableau Prep
39:48was able to get us to this place where we
39:50can just
39:50start to throw this data around and what it
39:53also allows us to do is maybe start to
39:55enhance this so
39:56we can associate some metadata about the
39:59colors associate metadata about specific
40:01quadrants
40:02at the moment you've got these coordinate
40:04ids but what you could do is you could
40:06potentially bring
40:06in another data source which represents
40:09identified coordinates that specific
40:12communities were
40:13targeting so you could essentially zone out
40:16different parts of this area and you know
40:19recreate a sort of a colored map that shows
40:21you those zones where communities were sort
40:23of
40:23actively battling and then you could sort
40:26of also use a parameter to control specific
40:29quadrants of
40:29this and cut it up in lots of different
40:31exciting ways but for now it looks rather
40:33unassuming um
40:34but I think that's sort of the fun of this
40:36you kind of have to think ahead now when it
40:38goes back
40:38to the data set I think earlier on I was
40:40talking about you know what do we need to
40:42do in Tableau
40:43Prep um now that I've thought about it
40:45overnight as well I think the by far the
40:47easiest thing we
40:48need to do is we need to get a scaffold
40:50essentially going and for the scaffold what
40:53we need is all the
40:55possible minutes or seconds in this whole
40:57entire experiment along one slide and for
41:00each of those
41:01slots of time I think minutes will make
41:04better sense seconds is just too too too
41:06much fidelity
41:08what we want to do is to capture the
41:10changes in each of those minutes and
41:13persist the change if
41:14no change is made so it's essentially like
41:17a fill down in excel if there's nothing in
41:19the row below
41:20then fill it down until we get to another
41:22change for that specific coordinate and we
41:24're doing that
41:25on a coordinate level so we take each
41:26coordinate and we basically say hey what
41:28was the last value
41:29for this coordinate if there is no new
41:31change from any user then we're going to
41:33persist that value
41:34and keep working our way down and that's
41:36actually quite an intensive thing in all
41:37tricks like I said
41:38earlier on I think the way I do that is
41:40using a macro but in Tableau Prep there's
41:42probably a good
41:43way of doing it it's going to massively
41:45increase the I think the length of time
41:47that this flow will
41:48run but we're going to keep persisting and
41:50so we'll call that maybe part two maybe in
41:52a week's
41:52time when I have a bit more time to dig
41:54into this I will be able to do that or
41:55maybe you have time
41:56to do that over this week to sort of take
41:58on the rest of this challenge and by all
42:00means grab the
42:01data set grab the flow reuse it change it
42:03do whatever you want with it publish your
42:06viz on
42:06Tableau Public if you get to it before me I
42:08'd love to see sort of what people build out
42:09of this
42:10I think it's just great to showcase that
42:12Tableau is probably one of the few tools
42:13that can pick
42:14up this data set and immediately start to
42:16show results in what is effectively a very
42:18short amount
42:18of time okay thanks for watching I'll catch
42:20you in the next video we'll catch you in
42:22part two I'm
42:23going to carry on pushing through my 22.1
42:26feature updates and then we we're doing
42:29something really
42:30exciting at the moment I can't talk too
42:32much about it and last year I said I was
42:33going to do a big
42:34push towards new technologies and that has
42:37kicked off in earnest and we're doing
42:39content around
42:40Tableau server which will be coming out
42:42soon and I can't talk too much about some
42:44of the other stuff
42:45I just want you guys to see it when you see
42:47it and we'll get going but for all but for
42:49now that's
42:50pretty much it from me I'm super excited to
42:52be you know in the space making content for
42:54you guys
42:54and I'll catch you in the next video
Future-proof your career https://n1d.io
| We took the r/place data set from the 2022 edition and tried to see how far we could get with recreating the canvas.
The workbook, compressed dataset & flow: https://www.dropbox.com/sh/8glv542wrcskevw/AADq4SMeu0jq_-m-7XubXaLHa?dl=0
About r/place
”r/place has proven that Redditors are at their best when they collaborate to build something creative. In that spirit, we are excited to share with you the data from this global, shared experience.”
Timestamps
0:00 Intro
0:38 What is r/place
4:03 How Tableau Prep samples data
4:45 Planning out what we’ll do
6:48 Creating some better Identifiers
10:07 Bring the identifiers back into the data stream
11:58 Cleaning up the data set
13:54 What this data set actually shows
20:05 Outputting the file
21:54 Evaluating the output from prep
23:35 Opening the data set in Tableau Desktop
25:40 Building the canvas
37:09 Animating r/place in Tableau Desktop
40:41 Next steps
#place #tableau #analytics #data #rplace
Follow me on Twitter: https://twitter.com/TableauTim
My recording gear & what’s on my desk. https://kit.co/TableauTim/desk-setup
My website: https://www.tableautim.com/
My Screen Annotation Tool: https://j.mp/3HWc4Mj
My technology Channel: https://j.mp/3F0d28f
Share feedback and Suggestions: https://tableautim.canny.io/suggestions -
Join this channel to get access to perks:
https://www.youtube.com/channel/UC7HYxRWmaNlJux-X7rNLZyw/join ----------
(C) 2023 TN-Media LTD. No re-use, unauthorized use, or redistribution, of this video without prior permission.