Tableau Prep Builder: Cleaning / Cleansing data (Part 3 of 7)
Excel asks you to trust that every manual edit landed correctly, but Tableau Prep's clean step shows you exactly what changed and lets you reorder it.
- The clean step's toolbar adapts to the column you select and its data type, offering split, filter and calculation options for text but a reduced set for integers and dates.
- Data roles such as city, postal code and country let you tag columns with geographic metadata that downstream tools pick up automatically for mapping and analysis.
- Changes within a clean step aren't strictly sequential, so you can delete, reorder or drag steps, though reordering can break later steps that depend on earlier ones.
- Group and Replace can cluster misspelled values automatically using pronunciation, common characters or spelling, and you can also build groups manually.
- The bottom preview is a live Tableau view of your data, letting you drill from the high-level aggregate summary down to individual problem rows.
0:00Hey, it's Tim here, and welcome back to
0:03this third video in this Tableau prep for
0:06Excel
0:07users series. In this video, I'm going to
0:10be focusing in on one of the first steps,
0:12which is the cleansing data step. To do
0:15that, I'm actually going to open the sample
0:17flow
0:18found here on the bottom left of Tableau
0:20prep. Everyone has this in their default
0:22install
0:22of Tableau prep. If you don't, something's
0:25wrong with your installation. So click on
0:27this flow, open it up, and this will be
0:31what we're working with today. We're not
0:34going
0:34to actually build anything, we're just
0:36going to go through this particular flow,
0:38and we're
0:38going to add some additional steps to it
0:41and do some sort of data prep here. Now
0:43before
0:44I move on, I just want to go back to Excel.
0:47Now, if you're working in Excel and you're
0:50doing any sort of data cleansing, what you
0:52will typically be doing is working with the
0:54data itself. So this is literally the data
0:56that I'm highlighting here. You have lots
0:59of different data types. You can just see
1:01that there's text and there's numbers,
1:04there's
1:04some ID fields, but very much it's up to
1:06the individual to understand what's going
1:08on here.
1:09And this is the only visual sort of element
1:11and summary that you get. There's no sort
1:13of aggregation happening in the background
1:15to give you an overview of what's happening
1:18.
1:18Second more, let's say that I want to
1:20change some of the columns. Let's say that
1:22I'd like
1:22to maybe change some of the column headers.
1:25If I just scroll to the left here and go up
1:27again, I can just go in here, rename order
1:31dates to order date new, for example. And
1:34there you go, that's the sort of column
1:37name changed. But of course I can make a
1:40typo there
1:40and it's very easy to sort of make a
1:42mistake, change a wrong column maybe
1:44because that selection
1:45is basically down to human error. There's a
1:47few bits in other bobs. So a lot of sort
1:49of advanced users of Excel get very good at
1:51not making mistakes, which is fine, but we
1:54're
1:54all human. And even as a consultant, I've
1:56seen Excel files handed to me which have
1:58been
1:58told to me as gospel and using advanced
2:01tools like Alteryx or even Tableau Prep, I
2:04've been
2:04able to find mistakes in the data because
2:07the data was manually prepared. Now, the
2:10difficult
2:11thing here is you don't get an overview of
2:13what's going on. Let's say I just want to
2:16check which columns here have text as data
2:19types. Well, of course you'd say that
2:22columns
2:22like state, city, country are all basically
2:25just text, but you might also think the
2:28date
2:29is text, but it's not. It's actually a date
2:31type here and it's very different in a
2:33database
2:34when you store dates versus storing text.
2:37Now, you and I would write this in exactly
2:40the same way, but a database actually
2:42physically perceive this in a different way
2:44. If I just
2:45choose general here and go back to the
2:46default, you'll see that it actually
2:48interprets this
2:49as a number. Okay? I think this is the
2:51number of days since the epoch timestamp
2:54essentially.
2:55And so if I just undo that step and put
2:57that back to a date, the other thing you'll
3:00notice
3:00is that I made that change for that one
3:03cell. Okay? Now, typically what you might
3:06do is
3:06you might want to apply this to the whole
3:08cell. So you'll do this, you'll select the
3:10column, you go back, you go back to format
3:12cells, and then you'll say general, you
3:14click
3:15okay, and there you go, you've made your
3:17data change. But you don't get to sort of
3:19see that
3:19visual thing. You have to just trust that
3:22all those changes have been made.
3:23Now, the other thing is the data roles in
3:26this dataset aren't apparent. For example,
3:30the country, city, and state, and postal
3:32code are geographical bits of information.
3:35In fact,
3:35a lot of good visualization tools can take
3:37this and help you do things like mapping or
3:39help you do things like figuring out
3:41regions and sort of demographic information
3:44that can
3:44be overlaid onto this. And so if you want
3:47to use this in a more interesting way, it's
3:49not good enough just to store this
3:51information as text. If I just go into this
3:54column, you'll
3:54see here that it's just got a general data
3:57type and it's just being stored simply
3:59there
4:00actually as a number, not even as text. So
4:02that's the actual text donation. So it's
4:05very
4:06difficult to work with this and actually
4:07trust what's going on in the background.
4:08Now, if I head over to Tableau Prep and
4:11head over to a step tool. Now, if you
4:13recall the
4:14previous video, the step tools are these
4:17flat lines here that you see. They kind of
4:19don't
4:20have any sort of shape to them. They're
4:21really difficult to describe actually. But
4:24the step
4:24tool I'm going to be focusing on is this
4:26particular clean step here. I'm just
4:28selecting it here
4:29in blue. When I select it, it will open up
4:32this summary window. Okay? And if you
4:34recall
4:34in the previous video, the summary window
4:37has context based on the type of tool you
4:39're
4:40using. So the join tool just before it has
4:42a very different summary window to the step
4:45tool. Okay? So it's important to get
4:47familiar with the kind of things you can do
4:49inside
4:50of the step tool. Now, the first thing to
4:52bear in mind is that you've got this sort
4:54of row across the top. This is essentially
4:56your toolbar. And you should be aware that
4:59this changes depending on the column that
5:01you're working on and the data type. So let
5:03's
5:03say if I select the approve a column, I get
5:06a whole bunch of sort of tools available to
5:09me like splitting, filtering, creating a
5:12calculation. If I go to this numerical one,
5:14it changes
5:15because I'm now working with an integer
5:17rather than text. So I lose the ability to
5:19split
5:20it like as I would do with text, but I keep
5:22some of the other capabilities. If I go to
5:25a date column, let's look at order date. It
5:31doesn't change. But the difference here is
5:33that Tableau is actually applying this as a
5:36date. So if you click on this icon here,
5:38you'll see that there are lots of different
5:39data types. So Tableau is also visually
5:40telling
5:41me the data type for each column. And for
5:43some of you, you might say, well, the
5:45summary
5:46view is a little bit sort of unfriendly. I
5:48'd still like to be able to see the data.
5:51Well,
5:51that's exactly what you're getting here
5:53below. This preview at the bottom is
5:54actually a Tableau
5:55view of how that data looks like in Excel.
5:58And the really powerful thing here is if I
6:01just select maybe the items that take about
6:04one day to ship, let's really narrow this
6:07down. I'll hold Command, and in this
6:10particular example, I'll see items with a
6:13zero with a
6:1410% discount. I've narrowed this down to
6:17just one row, and I actually get to see
6:19that particular
6:20transaction there. It's very easy just to
6:23see that and scroll across, analyze it, see
6:25what's going on. This is particularly
6:27useful for finding problems because this
6:29aggregate
6:30view here at the top gives me that high-
6:32level view. Imagine like an eagle flying
6:35over a
6:35landscape. From a very high level, they can
6:38figure out what's going on, where to go,
6:41which
6:41way to head, where to find resources. And
6:43then in this Tableau view, you can actually
6:46drill into the problem and maybe work out a
6:48data cleansing solution and go from there.
6:50Now, if I zoom into this icon here, this is
6:53the integer icon, you'll also see that you
6:56get another sort of type of data type, and
6:59Tableau calls them data roles. You have
7:02emails,
7:03URLs, and geographic data roles. If I just
7:06sort of scroll down here a little bit,
7:08sorry,
7:09zoom out, then I have to zoom out here a
7:12bit, then I can zoom back in. If I just
7:15sort of
7:16go up and down this list, you can see that
7:18Tableau understands lots of geographic sort
7:20of data types. So it's actually brought
7:22that here into Tableau Prep, and I can
7:24actually
7:24specifically assign those data roles to a
7:27particular column. This then means that
7:29when
7:29I push this data onto Tableau or other
7:31tools, they'll automatically pick that
7:33metadata up
7:34and start using that to help with
7:36visualization or whatever those tools do.
7:39Maybe it's an
7:39analytical solution, whatever. If they can
7:42use these kinds of metadata, then you can
7:44put them here inside of Tableau Prep. In
7:47Excel, I have to do that manual sort of job
7:48of going
7:49in and figuring out what's going on. Now,
7:52the advantage here is that Tableau doesn't
7:54just leave you hanging. If you remember
7:56here at the top, we've got our tool pane.
7:58There's
7:58this section which looks at recommendations
8:00. You know, these are things that Tableau
8:02Prep
8:03thinks you can do with your data, okay? And
8:05you can see here that it is actually
8:07suggesting
8:08certain data roles for certain columns. And
8:11if you look at these, you can see that it's
8:13suggesting the city data role for the
8:15column name city. There's no surprise as
8:17far as they're
8:18literally called the same thing. So I'm not
8:20sure that's even machine learning or AI. It
8:22's
8:22just being smart. Postal code, country,
8:25state, these are all geographical items.
8:27And there's
8:27actually a geographical sort of equivalent.
8:30And so Tableau is just suggesting that we
8:33do that. If I zoom back out and I click on
8:37the city, you'll see that it actually zooms
8:40in on the city column here in blue. And it
8:42shows that recommendation at the top. And
8:44if we zoom into that, it just says here, "
8:46Many values match this data role." I.e., it
8:49's looked
8:49inside of this column and has realized that
8:52many of the values actually match cities
8:54that
8:55Tableau is aware about. And so it's
8:57suggesting that change. And if you hit
8:59apply, nothing
9:00happens to that column. It stays exactly
9:02where it is. But you can see here that it
9:04now has
9:04this new data role which is city. And you
9:07can see that this icon has been applied.
9:10Now
9:10this icon is a change. And different
9:12changes have different icons. If I zoom
9:15back out,
9:17you'll see that this particular step
9:21already has four changes. If I click this
9:23triangle
9:24to expand it, you'll see that there have
9:26already been changes made. There were three
9:28changes
9:28actually prior to me making that change for
9:31the data role. Now the way I know that is
9:34if I look at the clean step up here and I
9:35zoom into that, you'll see that I actually
9:37have icons at the top. And if I zoom back
9:40out and then just move this up, you can see
9:43that it actually tells me all the steps. So
9:46I have two remove fields, removing order ID
9:49and product ID. Then I have one rename step
9:52. And then I change my role to city. And for
9:55every step that we add, those will appear
9:58here on the left hand side. And the number
10:00of changes will either change to four, or
10:03if you've got this open, you'll see them
10:04sort
10:05of accumulating here, almost like a shelf.
10:07Okay. Now the powerful thing here is you
10:10don't
10:10have to go with that. Let's say you want to
10:12remove a step. Well, these aren't
10:14sequential
10:15changes as such. You can sort of remove
10:17them and ad hoc. So let's say I didn't want
10:19to
10:19remove my product ID field. I can simply
10:21delete that step and Tableau will run the
10:23remaining
10:24steps in sequential order. Okay. If I go
10:27back one step and add that back in, any new
10:30step
10:30that I add is going to appear below here.
10:33Let's say maybe I want to do that
10:34differently.
10:35Let's say I want to change the order in
10:37which these steps run. I can simply drag
10:39that particular
10:39step further up and it changes the way that
10:42everything operates. Now that in some
10:44instances
10:45might meaningfully change the way the data
10:47works. So if that's a useful step, then it
10:49's
10:49a good thing to be aware of. And then you
10:51'll see what I did here is I clicked on the
10:53final
10:53step and Tableau has continued that data
10:56processing all the way through. Now, one of
11:00the last
11:00things to think about is you might want to
11:02remove certain rows. Maybe you want to
11:04remove
11:05certain null values. Maybe, for example, I
11:07'm only analyzing return cells. And so I'm
11:09only
11:09interested in looking at the cells which
11:12have return reasons. I can simply right
11:15click on
11:15null and select exclude. And those are
11:18removed from my data set. And you get the
11:21sort of
11:21visuals here updating these little bar
11:23charts that you get in each column visually
11:26update
11:26to show you how much more of the data is
11:28left in proportion of the sort of entire
11:30context.
11:31And don't forget that change that we made
11:33that's filtered here on the top right of
11:35this
11:35column is also here on our pane on the very
11:38left hand side. Now, the last thing I want
11:40to show you is some of the sort of unique
11:42aspects of Tableau Prep and the way it can
11:44help clean data. Okay? I want to find a
11:47column called approver. And the approver
11:50column is
11:50very interesting because if I zoom into it
11:52here, you'll see that at the very top, you
11:54've
11:54got two sets of names, E. Williams and C.
11:57Arnold. And they're spelled in every sort
12:00of different way. And one of the things you
12:02might do is group these together in some
12:04way
12:04or form. In Excel, this might be a manual
12:06job just going in and just, you know,
12:08selecting
12:09the records and pasting the full name just
12:11to get a nice clean data set. You might set
12:14an intern on this and get them to clean a
12:16whole bunch of data. Well, that's not the
12:18smart way to do it. Tableau Prep has a
12:20really cool feature here that allows you to
12:23group
12:23items that sound the same. And so as well
12:26as filtering, which we did before, you can
12:28also group and replace items based on
12:31different factors. So in this particular
12:33case, I could
12:34do it manually. But some of the more
12:36powerful thing is using pronunciation or
12:39characters
12:39that are common across the set or maybe
12:41even spelling. So let me just use
12:43pronunciation
12:44and see what happens. It opens up this new
12:47interface. And I get this slider that tells
12:50me on the strength of a particular grouping
12:52. And anywhere where Tableau has actually
12:55achieved
12:55a grouping, you'll see here that it
12:57actually has this paperclip icon to the
12:59very left.
13:00And it shows you how many rows have been
13:02thrown into that group. Okay? If I click on
13:05"See
13:05Arnold," you'll get to see all the
13:07subsequent items that fall within that
13:09grouping. And
13:10if you don't like something, you can just
13:12simply untick it to remove it and it will
13:14appear again. But if you want to, you can
13:17click it from the outside and group the
13:20values
13:20and that will bring it into sort of the
13:22grouping. And you'll see that it pops back
13:24in there.
13:25Okay? So the better way to do it is to
13:27absolutely use the pronunciation of the
13:29smart algorithms.
13:30But you can absolutely do a manual grouping
13:32as well. I'll just type in "See Arnold"
13:35here
13:35because I've done a manual group. Hit enter
13:38. And now that grouping has been carried out
13:40.
13:41You also get the bar charts updating to
13:43reflect that. If I click on "Al Jenkins,"
13:46you'll see
13:46here that there's not many sort of other
13:49items here. But "Al Jenkins," "Al Smith,"
13:52and some
13:53of the other groupings seem to not be
13:55needed. It's just those two names that were
13:57causing
13:58an issue. So when you're done, you can just
14:00hit "Done" and we are done. I can zoom back
14:02out. And now that step has appeared over
14:06here. The group in replace uses this paper
14:09clip icon
14:09that you see here on the left. Okay. The
14:14last thing to do is to show you some of the
14:16really
14:16cool functionality that TaboPrep has. I've
14:19selected the "Approver" column. And you can
14:20see, as I showed you in the previous video,
14:23that these lines reflect where this "Appro
14:25ver"
14:25column is coming from. It's coming from the
14:28returns data and where it's going to. Okay.
14:31So the "Approver" column won't be found in
14:33any of my other datasets. So you can sort
14:35of trace the history of this column, as it
14:38were, through your data operations. The
14:40other
14:41thing to bear in mind is that as you work
14:42through different data types, let's say I
14:44click on this column. This keeps changing.
14:47But more importantly, I can create
14:49calculations
14:50and start working with really interesting
14:52sort of logic. Okay. And the way to do that
14:55is simply to select the column. And I can
14:57just select the option here to create a
15:00calculated
15:00field. With most things Tableau, there's
15:02always more than one way to do it. So this
15:04is one
15:05option. I could also click on these three
15:07dots, create a filter, and then have a
15:09calculation
15:09filter based. I can also just go ahead and
15:12create a new calculation here on the right
15:14hand side, just by going ahead and
15:16selecting "Create Calculation." Okay. So I
15:19'll just
15:20create a calculated field, and it opens up
15:21this interface. Now in the next video, I'll
15:24dive deeper into formulas and calculations.
15:27And so I'm not going to do anything here.
15:29But just to be familiar, you get all the
15:31different groupings like you would do in
15:32Excel. So if
15:33I scroll down, you'll see all the functions
15:36that you're used to. Date diff, date trunc
15:40ation,
15:40date, lengths, you know, all the kind of
15:43interesting things. Now, you won't see
15:46things like VLOOKUP,
15:47because VLOOKUP is actually a different
15:49type of operation that I'll go into later
15:51on, which
15:51is more synonymous with database joins. And
15:54so that's how Tableau will group that item
15:56and sort of term that particular operation.
15:59Right. That's pretty much the step function
16:02.
16:02Every time you add a step, you get the full
16:04power of everything I've just described in
16:08the setup. So if you look at some of these
16:10other steps, it goes to this step here.
16:12Although
16:12we've only got four columns, as I click
16:14through these, you'll see those options
16:16appear.
16:17Again and again, if there's no
16:18recommendations, you won't see the
16:20recommendation window appearing
16:21here. But this is a really, really powerful
16:24way to have a look at your data and work
16:26with
16:26it. If you want to make this Tableau view
16:29fill this space, you can just click on the
16:30second icon, and you'll get a little bit
16:32more data. And if you want to just see the
16:35columns
16:35and their own without any data, you just
16:37click this final icon here on the right
16:39hand side.
16:40And you can actually see what's going on,
16:42along with any changes that have been made
16:44in this particular step. So if you're
16:46making changes to a column, this is an
16:49amazing place
16:49to see how that's being made and what
16:52exactly is going on. Now, one thing I'd
16:55like to do
16:56is just show you one other thing that's
16:58very, very easy to do here in Tableau Prep.
17:01Let's
17:01say I'm not happy with the order of these
17:03columns. I can just simply move them around
17:06.
17:06Job done. In Excel, that would be much
17:08harder. I mean, I'm not a great Excel user,
17:11but my
17:11understanding is you'd have to insert a row
17:13, move the column across, and then move the
17:16column into where you want it. So you have
17:18to create the space to make the space. It
17:20just gets a little bit complicated. In
17:21Tableau Prep, you can just click and drag,
17:23and there
17:24you go. Again, if you want to rename it,
17:26just click on the name of the column, order
17:28ID.
17:28I'll call this new. Hit enter, and there
17:31you go. We've just renamed the particular
17:33field.
17:33That's considered a step. Okay, so if you
17:36do that step last, it's not ideal. You
17:38might
17:39want to do it a little bit earlier on in
17:41the flow, but then again, that might change
17:43subsequent
17:44steps. You'll see here that it's actually
17:46broken this particular step because it's
17:48saying
17:48it can't find order ID one. So just be
17:51careful with the way that you do your
17:53operations.
17:54If you want to get rid of it, you
17:56absolutely can delete the step, and that
17:59error goes away.
18:00Okay, that's been the video. If you've
18:02enjoyed the content, if you enjoyed the
18:04video, let
18:05me know. These are a little bit longer than
18:07usual, and thank you for bearing with me as
18:08I sort of take you through the basics of
18:11Tableau Prep. If you would like a different
18:14kind of
18:14video about Tableau Prep or even Tableau,
18:16drop a comment below, let me know, give me
18:18some
18:18feedback. Otherwise, make sure you
18:20subscribe so you catch the latest content.
In this video we’re into the series and I’ll walk you through the basics of cleaning data in Tableau Prep Builder. -----Join my Discord Server. https://discord.gg/shBuxXr it’s a little sparse at the moment but hang in there.