0:00Hey, it's Tim here.
0:00In today's video,
0:01Tableau have added one of the most
0:03important features
0:04in Tableau Prep, Stratified Sampling.
0:06If you don't know why it's important,
0:08I'm gonna break down sampling in Tableau
0:10Prep in this video,
0:11and then we're gonna explain why Stratified
0:13Sampling
0:13is gonna take that to the next level
0:15and make Tableau Prep more performant and
0:17stable
0:18as you work through your data.
0:19To find out more, as ever, let's get stuck
0:21in.
0:22So when I first heard about Stratified Sam
0:24pling,
0:24frankly, I didn't know what it meant.
0:25I'd never come across this term Stratified
0:28Sampling,
0:28and that maybe shows to my experience.
0:30I'm used to working with all the data all
0:32the time.
0:33I've never really had to work with the data
0:34set
0:35where I've had to sample it in the past.
0:37And so what I did is I reached out to
0:39Google
0:39and I came across this really nicely
0:41explained example
0:42on Kaggle now.
0:43This has been put together by Jardel Nasc
0:45imento.
0:46I hope I've said that correctly.
0:47And he's doing it in the context of Python.
0:50And he actually has a really nice diagram
0:51here
0:52that explain the concept of Stratified Sam
0:54pling.
0:54Essentially what you're doing
0:56is you take your population of data
0:58and then you choose a dimension
1:00that you're going to use to group that data
1:02with.
1:02In this particular case,
1:03if we look at this section over here, it's
1:06using gender.
1:06So he then splits his data into two halves.
1:09And then from those two halves,
1:11there's a random selection done.
1:12And then you get to the sample essentially.
1:14So you're sampling your data,
1:16but you're using some sort of grouping
1:17to make that sampling more representative
1:20of the challenge that you're looking at.
1:21So the context really matters here
1:23because in the context of Tableau Prep,
1:25what we actually care about isn't
1:27necessarily the output
1:28because Tableau Prep always runs all the
1:30data
1:30through the output.
1:31What we care about here is when we're doing
1:33data prep,
1:34being able to see what's going on with our
1:36data
1:37whilst using the sampling setup.
1:39So what I've done is I've taken this exact
1:41dataset
1:42that Jardel has used here,
1:43and I've put it inside a Tableau Prep.
1:45It's essentially information about credit.
1:47And one of the things I like to sort of
1:49contextualize here
1:50is that sampling is one of these features
1:52that you probably don't even realize
1:53is going on in the background
1:55until you work with a large enough dataset
1:57and you realize you're not seeing all the
1:58rows.
1:58So what do I mean?
1:59Well, if I click on the input here,
2:01you can see that I have some settings
2:04and a setting that maybe you're not
2:05familiar with
2:06if you need to Tableau Prep is this option
2:08here,
2:09data sample.
2:10Now, when I go and click on that,
2:12this is split up into two halves.
2:14Firstly, you have the number of rows
2:15that are gonna be brought in,
2:17and then you have the method at the bottom.
2:19So the row selection is actually what's
2:22going to choose
2:22which rows come in.
2:23So the number of rows you bring in
2:25and the selection method are actually two
2:27disparate things
2:28and not the same thing.
2:30And so what Tableau have added here in 23.3
2:32is this option here, Stratified Sampling.
2:35And so how does this work?
2:37Well, at the moment, my dataset isn't
2:39actually that large.
2:39So if I go ahead and click on Clean 1 here,
2:42you'll see that I see all 2000 rows.
2:45And so you're probably thinking,
2:45well, why would I even bother using this
2:47feature
2:48in this particular set?
2:49Well, if we look at our data
2:51and Tableau Prep is fantastic for this,
2:52you can see that each of these sort of
2:54groups here
2:55have a pretty decent spread.
2:57But if I'm gonna work on, let's say,
2:59a multi-million row dataset inside of Table
3:02au Prep,
3:02the default column,
3:03I'm gonna want to make sure that I sample
3:05my data
3:06and I see enough representative examples
3:08of this default behavior.
3:09And you can see most people in this credit
3:12data
3:12don't default on their payments.
3:13And so it's actually a very small
3:15percentage, 14%.
3:17And so if I was working on, let's say,
3:19a 3 million or 10 million row dataset
3:21and 14% of my data was this default,
3:26in the sampling scenario,
3:27you might actually get an under-represent
3:29ation
3:29of the rows that are about defaults.
3:31And so if you're trying to analyze defaults
3:33in that setup,
3:34in the context of the whole dataset,
3:36you're going to be slightly blind
3:38because essentially it won't be sampling
3:39as many rows as you want.
3:41So how do you force Tableau Prep to sample
3:44a sort of a better spread of this
3:46particular attribute
3:48in the dataset?
3:48So let's go do two things.
3:50Firstly, I'm actually gonna go into this
3:52credit data,
3:53data sample, and I'm gonna limit this to
3:55just 100 rows.
3:57So let's go ahead and select 100.
3:58So you can just see the effect of what I
4:00was just saying.
4:01So we'll set it to 100.
4:03That would actually run in the background.
4:04You won't see anything happening.
4:05And then if I go back to clean one,
4:07you'll see that we get the same sort of
4:09spread.
4:09But now if I just hover over this
4:11particular record,
4:12I have 13 rows, so 13%.
4:14So we've gone from 14% in the entire sort
4:17of dataset
4:18down to 13% just because of the way it's
4:20sampling.
4:20And so if I'm gonna work on this dataset
4:23and I wanna look at these sort of 13
4:25default payments,
4:26I'm really only working with 13 rows.
4:28And in those 13 rows,
4:30I might not have the necessary dimensions
4:33and behaviors that I wanna see.
4:35So how do I get Tableau Prep to split this
4:37up
4:37in a more representative way?
4:39This is where a stratified sampling comes
4:40in.
4:41So let's go back to the input here
4:43and we'll go back to the data sample.
4:45And now that we are here,
4:47what we want to do is make sure
4:49that we go to this option here that says
4:51stratified.
4:51Now, when I click on that,
4:53you'll see that it gives us this option
4:54here
4:55that says client ID.
4:56And what it's essentially asking is,
4:58hey, how are we going to use the specific
5:01columns
5:02in our data dimension in our data
5:03to understand how we're gonna spread this?
5:06And so we'll go ahead, click on client ID,
5:08and I'll say, I want you to use default
5:10as the method of stratification here.
5:13And so what that will do is it will say,
5:14okay, these are the two groups.
5:16I'm gonna go and select 100 records,
5:17but I'm gonna try and make sure that they
5:19're balanced,
5:19more balanced than they are at the moment
5:21given the spread of the data.
5:23So if we go back to clean one
5:25and we have a look at our spread,
5:26now look at that, you see we have zero and
5:29one,
5:29you have 50 rows from that set and 50 rows
5:31from that set.
5:32So now in the context of data prep,
5:34I'm actually seeing a more representative
5:36split
5:37of the options across this specific column
5:39because I'm trying to build something
5:41for the defaulting behavior in my dataset.
5:44Now, yes, it's not a representative sample,
5:46but then when I run this data through in
5:48its entirety,
5:49so when I go up here and I, for example,
5:51select run
5:52and I run the flow, if I was to add an
5:54output here,
5:54for example, and I was then to go and hit
5:57run,
5:57what this would actually do then
5:59is run the whole entire dataset
6:00even though my sample is set up slightly
6:02differently.
6:03So this feature is really just for working
6:05inside of this view and working in a fast
6:08way.
6:08Now, why would you not just run all the
6:11records
6:11through Tableau Prep?
6:12Well, if you've worked with Tableau Prep
6:14for a while,
6:14you'll know that sometimes you do get
6:16performance issues
6:16when you're working with really large
6:18amounts of dataset.
6:20I think it's just the way Tableau Prep is
6:21built.
6:22I know it's built off web technology in the
6:23background,
6:24something called Electron, but I'm not a
6:26technical person.
6:26I don't know too much about it.
6:27What I do know is that when you try
6:29and push too many records through,
6:30you start to get these sort of issues and
6:32bugs that pop up
6:33just because the sheer amounts of
6:35processing that's going on
6:36and depending on your resources and your
6:37computer
6:38and a bunch of other things, again, I have
6:39no idea about,
6:40and that can cause issues.
6:42So sampling allows you to do two things.
6:44It allows you to keep your row numbers down
6:47.
6:47So if you're working with a many million
6:49data,
6:49so you can keep that right down,
6:51but also make sure that the spread
6:52of what you're getting through is
6:53representative
6:54so you can work with it.
6:55And now that data sampling can be done
6:57across lots of different columns.
6:59I would always recommend you maybe use this
7:01on the column
7:01that best represents the thing you're
7:04trying to analyze
7:05and its spread, if that makes sense.
7:07So if I was analyzing, let's say,
7:09more demographic information about default
7:11payments,
7:12I might choose to point this towards a
7:14demographic field
7:15so that I get representative spread of
7:17certain age groups.
7:18All these sort of things are things to
7:19think about.
7:20But what you're really trying to do here
7:21is improve the performance
7:22and make sure that what you're seeing
7:24actually makes sense
7:25and allows you to do the data prep you need
7:28rather than changing the output.
7:29I'll keep repeating that
7:30'cause I think sometimes people think that
7:31when they sample,
7:32it's not gonna come through with the output
7:34.
7:34That's not what happens when you hit run.
7:35When you hit run, all the records come
7:37through.
7:38So that is stratified sampling in 23.3.
7:41Hopefully, this explanation has been clear
7:43enough.
7:44If there's anything you'd like to add,
7:45maybe I've got something wrong
7:46or you work in statistics and you wanna go
7:48nerd out
7:49on some of the context of what I've said,
7:51let me know in the comments below.
7:53I'm still learning, so if you can help with
7:54that,
7:54I'd really appreciate it.
7:56Thanks for watching and I'll see you in the
7:57next one.
7:58(silence)
8:09[BLANK_AUDIO]