Tableau 2020.2: Incremental refresh in Tableau Prep Builder
Why run your calculations twice on data that hasn't changed? Here's how incremental refresh in Tableau Prep saves you the trouble.
- Incremental refresh in Tableau Prep is configured in the input settings, but you must first build out the full workflow and create an output before it can be enabled
- You identify new rows using a field that always updates with new records, such as order date, and Prep matches this against a corresponding date field in the output
- On the output step you can independently control what a full refresh versus an incremental refresh does, including setting a full refresh to only append data for growing historical data sets
- Incremental refresh only processes new rows, so calculations like date differences are not recomputed for data that has already passed through the flow, which is vital for very large data sets
- The feature is especially powerful when running Tableau Prep Conductor on Server or Online, letting you process only incremental additions rather than the entire history each run
0:00In Tableau 2020.2, Tableau have added the
0:03capability to do incremental data refreshes
0:05inside of Tableau
0:06prep. Now the other benefit here is that
0:08you can also append data to data sources
0:10using the same
0:11capability. So let's take a look and see
0:13how that works. I've got two files here. I
0:16've got this
0:17sample superstore file which is essentially
0:20the full file that we get when you install
0:22Tableau.
0:22It's essentially the sample date that comes
0:24with Tableau desktop. Then I've got a
0:26slightly smaller
0:27file here which only contains the first two
0:29years of data, so 2016 till the end of 2017
0:32.
0:33Okay, so what we're going to do is we're
0:34going to first connect to this smaller file
0:36,
0:37and then we're going to do an incremental
0:38refresh to see how it handles the
0:39additional data. I'm
0:41just basically going to update this file by
0:43overwriting it. Okay, so let's go ahead and
0:46connect to this file in Excel. You'll see
0:48here that it opens up the folder that I was
0:50in already,
0:51so I've already got this pointed to that.
0:53We'll go ahead and open it, and as we
0:56connect to this
0:56file in Excel, you'll see that we get a
0:58list of tables. We're actually getting a
1:00bit of
1:00duplication here. This is because the table
1:03and the named ranges are essentially the
1:06same. So
1:06there's a table and a name range that are
1:08basically highlighting the same thing. So
1:11we're just going to take the table, the raw
1:13data that's inside of the Excel file,
1:14and you can see here the different fields.
1:17Now the incremental refresh setting is
1:20actually in here.
1:21It's in the input setting, and in order to
1:22get this to work properly, we're going to
1:24need a few
1:25things to set up. If I just zoom in here
1:28and hit enable, you'll see that I have
1:31another option that
1:32turns up when I tick that box, and I can
1:34identify new rows using a particular field.
1:37Now because
1:37I'm looking at orders, every time an order
1:40is made, the order date is updated. So each
1:42order
1:43has a new order date. You can never sort of
1:45go back and change the order date on a
1:46particular
1:47order. So that's going to be the field that
1:49I use to identify new rows of data in my
1:52data set.
1:53And you can see here that it wants an
1:54output, and when I select that drop down,
1:56I have no output, and that's because I
1:58simply haven't built out the workflow yet.
2:00So you're
2:00always going to want to make sure that in
2:02order to set up an incremental refresh, you
2:05build out
2:05the workflow. I'm just going to do a very
2:07basic build here. I'm just going to add one
2:09clean step
2:10in. It's going to give me an alert because
2:12I haven't actually sorted this setting out.
2:15So
2:15I'm going to untick that so that we can get
2:17rid of that notification. Click on the step
2:20just after.
2:21You'll see that alert disappears, and you
2:23can see here that I've got two years worth
2:24of data.
2:25Okay 2016 to 2017 and then 2017 to the very
2:29beginning of 2018 in the data set. So now
2:33that
2:33we've got this, I'm just going to create an
2:35output, a very simple output. I'm not even
2:38fussed where this goes to be honest. I just
2:40need the output to exist so that I can
2:42finish setting
2:42up the incremental refresh. I'm now going
2:45to go back into my input, go back to
2:47settings,
2:48enable the incremental refresh, and now you
2:51can see I can select an output. So this
2:53output name
2:54here, it's really important to make sure
2:55you name these correctly because this is
2:57what will come
2:57through over here. So I'm going to select
3:00that output and you're going to see that
3:02this is going
3:03to ask me for which field in the output it
3:05should be monitoring to see updated data.
3:09So you notice
3:11that Tableau here has not just chosen any
3:13two fields, it's chosen the two fields that
3:16are most
3:16similar to the order date, i.e. a date
3:18field. So it's either the ship date or the
3:20order date.
3:21And so I want to make sure these two marry
3:23up. It might be that you're working in two
3:25different
3:25databases and actually you have incremental
3:27data that comes in on one side but you need
3:30to actually
3:30use a different column in another data set.
3:32And so that's a potential use case for
3:34having different
3:35sort of named fields in this particular
3:37case. Okay so now that's all set up. If we
3:40go over to
3:41the output, you see that we have a few new
3:42options here. So you can see here that you
3:45have the
3:45ability to do a full refresh and when you
3:48do that you actually get the ability to add
3:50what happens
3:51when that runs. So you can actually say
3:53that when you do a full refresh it should
3:55only add data.
3:56This is useful for example where you want a
3:58data set to grow larger every single time
4:01you run the
4:01workflow, maybe because you're updating
4:03historical records and all you're doing is
4:06appending
4:06information to the data set or you're maybe
4:09trying to get a different shape to your
4:11data
4:11and you're trying to use that in some other
4:13analytical purpose. But I'll just keep
4:15these
4:15as their default. So a full refresh will
4:18refresh the whole data and an incremental
4:20refresh will
4:20add data. That's sort of the most logical
4:22way of doing this. You can swap those two
4:24around
4:24as I said for then that particular use case
4:27. And then you'll also see a small change
4:29here in the
4:30run flow button. When I click on that drop
4:32down you'll see here that a full refresh,
4:34you get the
4:35full description of what's going on here
4:37for both full refreshes and incremental
4:39refreshes. So let
4:40me zoom back out. We're actually going to
4:42have a go at simply just running this
4:43workflow. We've
4:44never run the workflow so it's just going
4:47to save this out to the default location.
4:50Unfortunately in
4:50this version it's decided to crash. Let me
4:52try and fix that. If we go to the alert you
4:54'll see here
4:55that it's actually giving me a good reason.
4:57I'm essentially trying to write to the same
4:59file that
4:59I'm reading from so I actually need to give
5:02this a slightly different set of locations.
5:04So let's
5:05go ahead and maybe just save this in a
5:07different location. Let's go away to my
5:10desktop if I can
5:12find it. Here we go. Let's go over here to
5:14this folder. This will do. I'll call this
5:16output update
5:18the hyper click accept and click accept
5:23again on that. And now this is all up to
5:28date. I should be
5:29able to run this flow successfully. So let
5:31's have another go at running it and you can
5:33see
5:33that successfully run this time around. So
5:36if I take a look at my data set you can see
5:39that
5:39everything is in there. Again we've only
5:41got these two order dates. Now what I'm
5:43going to do is I'm
5:44going to switch the files over. I'm going
5:47to simply take the data that is in this
5:49file. So this is the
5:50new file. I'm actually going to copy it. I
5:53'm on a mac so I'm going to hit hold alt and
5:55I'm going to
5:56drop it into that table and it's going to
5:57ask me what I want to do. I want to replace
5:59the file
6:00that's in there and now that's up to date
6:01you can see the file size has changed here.
6:03Now if I was to click on this again and
6:08then come through you'll notice what it's
6:12actually doing in
6:12the background is it's only running
6:14additional rows. Let me just refresh this
6:16connection to
6:17make sure it's pulling in a non-cached
6:19version of that data set and then we can
6:22see this update
6:23in front of us. There you go. So now you
6:27can see it's pulling in the full data set
6:31but the
6:31difference here is when we actually run
6:35this output if we run this and do an
6:37incremental refresh
6:39it's only going to run the flow for the new
6:41additional rows that have actually come
6:44through
6:44our data. So if you're doing any sort of
6:46calculation or anything else those are not
6:49going to be redone
6:50for data that had already been through the
6:52workflow if that makes sense. And this is
6:55fundamental because if you've got a really
6:56large data set let's say you're working
6:58with
6:58a multi-billion row data set you're not
7:01going to want to run all your calculations
7:05and computations
7:05again on data that has already been run
7:07through this process especially if it doesn
7:10't change.
7:10Let's say I'm calculating the the date the
7:13difference between the order date and the
7:15ship date you can see here once this is
7:17calculating an order why would I want to
7:20run that twice and so
7:21the incremental refresh is extremely useful
7:24in this particular case. You also get the
7:26ability
7:27to run them here so you can see that in
7:28this small drop down I get the ability to
7:30do an incremental
7:31refresh there as well so I can just hit the
7:34incremental refresh what it's actually
7:36doing
7:36is it's adding data to the data set and you
7:39notice this was much faster this time. It
7:41just
7:42ran the data right through. And so now I
7:44have all my data available to me in and so
7:49now you see I
7:50have all my data available to me in the
7:52extract. This is particularly useful if you
7:54're running
7:55Tableau prep conductor on the Tableau
7:57server or Tableau online. Essentially this
8:00allows you to run
8:01sort of data prep on incremental additions
8:04to databases that you might be working with
8:07. So let's
8:07say again you're looking at orders you don
8:09't need to run this process every single
8:11time on old
8:12orders otherwise your your process is going
8:15to just keep taking longer and longer and
8:17longer
8:17as you build more years of history into
8:19your data set. That's pretty much the
8:21feature in a
8:22nutshell if you've enjoyed this video hit
8:24like hit subscribe if there's any question
8:26you'd like
8:26answered drop them in the comments I'm
8:28pretty responsive on comments and if there
8:30's something
8:31that you don't see in this channel or you'd
8:32like to see more of drop a comment below
8:34let me know
8:34know what you'd like to see. I'll catch you
8:36in the next video.
8:37[ Silence ]
Be sure to check out out my professional home, @The Information Labfor more great Tableau, Alteryx and AWS content. Visit my blog https://tableautim.com Say hi on twitter: https://twitter.com/tableautim?l --- Files used in my videos for 2020.2 videos: https://j.mp/2YvaEot ----Join my Discord Server. https://discord.gg/shBuxXr it’s a little sparse at the moment but hang in there.