When data is organized and defined to match business processes, it serves as the jet fuel to bring about actionable insight and a thriving self-service culture. Well-structured data models are the backbone that support rapid, accurate analytics. How strong are your data models?
Three common signs your data models may be letting you or your organization down:
- Excessively long report run times
- Difficulty finding the data you need from all sources
- Having to spend more time writing SQL code than doing analysis
In addition to addressing the importance of data modeling, we also discuss database basics and:
- Why a data warehouse and data marts are still important, even today
- Why what you call a data warehouse probably isn’t one really
- The impact of the source data on your reporting
- Our discussion will focus on data structures themselves and is independent of any front end tool. This webinar is for everyone, regardless what analytics platform you use.
A certified Microsoft business intelligence architect and developer with over 20 years of experience in software development, Andy has made regular appearances at the Power Platform World Tour events over the years. He also serves as assistant organizer of the NJ/NY branch of the Power BI User Groups.Read more
Welcome to Senturus’s webinar series. Today’s Soup du Jour is good data models lead to great BI.
You’ve got questions, we’ve got answers.
Please feel free to use the GoToWebinar control panel to make this session interactive, we are usually able to respond to your questions while the Webinars is in progress. And if we don’t reply immediately, we’ll definitely cover it, either in the Q and A session section at the end, or via a written response document that we’ll post on the Senturus.com website.
The first question we usually get is, can I get a copy of this presentation, and unfortunately the answer is no.
Just kidding. You can absolutely get a copy. It is available on Senturus.com under the Resources tab, and then under the Knowledge Center and we just shared the link in the GoToWebinar control panel.
Today’s agenda, we’ll start off with some introductions.
We’ll talk about whether data modeling is still relevant.
Why is data modeling important? What is a data model? We’ll do some live demos. We’ll do a quick overview of Senturus and some additional resources that we have, and then we’ll end up with a Q and A on some of your questions and hopefully get answers to you.
Joining us today is Andy Kinnier.
Andy is a certified Microsoft Business Intelligence architect, and a developer with over 20 years of experience in software development.
Andy has made regular appearances at the Power Platform, World Tour event, over the years. He also serves as the Assistant Organizer of the New Jersey, New York branch of the Power BI User groups.
My name is Todd Schuman and I’ll be the host today.
I run the Installations Upgrades and Optimization Practices at Senturus. You may recognize me from other amazing webinars such as Cognos Analytics, Performance Tuning Tips, and tips for installing Cognos 11.
With that said, I’d like to turn it over to Andy, who will take you through today’s presentation.
Thanks, Todd appreciate it. Hello, everyone. What we’ll talk about today is whether or not data modeling is still relevant.
And let’s dive right in, New Tech replaced modeling, right? And this is kind of the big question out there of big data technologies, and taking stuff, and throw it into a data lake. And a lot of people are out there saying, oh, you need to store your data in a data lake and apply it to this third party tool, and they do analytics right away in and get the answers you need.
And that’s a lot to be said for that, with big data technologies.
And one of the reasons that it came about was that it was, too hard to say, load into a data warehouse and a data warehouse we’re going to talk a little bit more about that, but it is essentially taken a dimensional model.
And then, loading the dimensional model as a data warehousing process tone.
The compare contrast of what we’re going to talk about a little bit today is, is whether or not big data technologies have discarded the use for a good dimensional model.
And there are claims, that’s the case.
But in very recently, we’ve had a chance in the Big Data world, where they’ve actually created some technologies where you can do some of the things. Like, for instance, inserts, updates, and deletes on files that are in a data warehouse, excuse me, in a data lake, and then have the same type of analytics of that dimensional model.
So it’s interesting that for a while, when people are saying that the dimensional model is no longer needed, you don’t have to spend your time doing that.
But now since the technology is there to do it, they’ve come up with what they call a data lake house, which is one way of achieving both big data technologies and a dimensional model in a data lake environment.
So it’s an interesting topic at this point.
The lake house, the data lake house, what attempts to do? This, a data warehouse here that does? This, is a data lake that does this and this data lake house kind of combines. And then back, once again as our data model or dimensional model that we do, and it’s both in a data lake house or in a data warehouse piece.
These are big data technologies. There’s a high minded concepts from an architectural perspective in an organization.
How do we go about getting the data out to our people, so that they can make good business decisions? And whether or not the Data model process is the right way to go, or is it still even useful in today’s world?
You all know our conversion, and that’s what the prior slide had just shown.
You know, the big data processing is now heading towards dimensional modeling.
Data Lake House concept, data warehousing has had big data processing for a while.
They’ve noticed that, with very large companies, very large amount of data coming in, how do we load a data warehouse overnight? It takes 30 hours for me to load my data warehouse. We have so much data, so you couldn’t do that. Amazon couldn’t do that. Facebook couldn’t do that.
All these companies had tremendous amounts of data coming in, is, so that data warehousing concept has to be put on the side.
And big data concepts came into play.
So, they came up with techniques to the map reduce technique, which evolved into a number of tools, taking something similar to that so that you can scale out processing power, and memory, and then do analytics on very large amounts of data.
And with the data warehousing tools that were out there, they had, if you do have the ability to load a data warehouse overnight, or in a certain sequence of time, say, every hour or so, then we still have very large amounts of data.
And we need to be able to scale that out in this technology of massively parallel processing or MPP appliances that were created some 15 to 20 years ago, and the data warehousing world, and now are part of the cloud platforms out there.
So, data warehousing has evolved into big data, in terms of scaling out to multiple processes and memory.
So, why is data modeling important? What does it do, what does it serve us? And this is just a great question, from a business standpoint.
And I’ve had many conversations with people, especially on the big data side, good big data architects, who are very good at the process that handle it.
And they normally, this is, I should say. I’ve had many times where they just don’t understand the idea of why we wanted a data model or put the effort into building it, and there is effort! So there was upfront effort, that’s important. But.
It’s always important to show the end result of it, and that’s one of the things we’ll do in this.
In this webinar is going to show some demos at the end as to why a data model is still important and relevant, as we said before.
From a bullet points perspective, ease of use, It’s easy to understand.
So we’re used to seeing dimensions, and fact tables, that represent, either the nouns in our organization and for nouns you would say as a dimension is, product, company, date.
You know, these are things that we understand from a company perspective before putting on it. I want information about a customer.
It makes sense to have a dimension called Dem customer or dem Product.
And then fact tables that go with it. So a fact Table is an event and an organization I made a sale. Sale gets.
It’s recorded as a transaction in a database.
And we put them into fact tables in a dimensional model.
And from that perspective, we know that effect table has an event that’s associated with it.
So if I’m looking at fact sales orders, I know what that is, it’s an order that came in, and I have a certain date filter applied to that.
And if I want to see my customers breaking down into their orders, then I can I have a relationship between fact and dimension. And it’s easy enough to drag and drop.
In a tool, a third party tool looks made for it.
It is very performance so there is this process won’t go too much into at this point.
But de normalization, when we take something, then we make it into a customer table will be a very wide table with many columns. It’s not going to be a table that relates to another table. See an address table, which relates then to a state table, and then relates to, let’s say, a country tape.
So, in this case, we have one customer table, we move all those fields, Interact customer table, and we just have one relationship back to the fact table.
And that makes this highly performant, especially since the last 10 years or so, column store indexing, that’s come out. It’s made for that sort of process to be very fast.
Tracking history, that’s a technique that will show later on what we can track history within a dimension, and tie that history into a fact table, so that you don’t have to know about behind the scenes. What’s taken place.
These models are ready for ad hoc analysis. A lot of the tools are used to that dimensional fact table approach and they bring that model in.
And then you do a lot of drag and drop that’s using third party tools such as Power BI, Cognos, Qlik, Tableau, amongst others.
The first reason for having a model, and this goes back to the nineties when they said, You know, I don’t want, I want to make a report on something.
I wanted to, well, let’s just say an application, like Amazon, first came out, got a transaction database behind the scenes. Why go to place an order?
It had a very efficient database that took into account my order from the list of products, from my information, and created an order in their system, and that’s called transactional processing in the database world.
And it’s great for feeding an application. So if I call up and I’m saying, where’s my order? It’s not safe yet. Then, someone that looks on screen pulls up a form, says, I want to see this order number from Mr. Kinnier and then there it is. And those databases were built to make that process very efficient.
Now, the problem came about was when some analysts said, well, I want to see what this product has done over the last six months, and then I want to compare it to the same six month period from the prior year.
And so, someone would write a very complicated piece of code and then they would run it on the system. And Must in very large systems Then you’d have.
It may take a half-hour to run and during that half-hour, when people are trying to place orders, or customer service is trying to service requests, the system is now bogged down by that reporting requirement that query that’s running in the background.
And so for dementia, was taboo of the application side.
And so that it doesn’t interfere with people, basically, giving your money, right? Which is this, the idea behind the transactional database?
You want to spend money, your application.
Cost, as an interesting part of this so-called, has always been what a lot of people think is a downside to the data warehousing dimensional modeling process. And there is upfront. Yes, there is.
But time the reusability and the scalability of the government reduce the amount of time spent on query or development as important part.
I don’t know, had duration where, as an analyst company, you spend a lot of time pulling data into Excel and mashing it up with the lookouts, putting in calculations. And every week, I have to do this. Now, I have to pull this new data in.
It takes me a little time to do that. I’ve got a parcel it out to different people in the company so that they can see, there’s all X amount of time from people that should be spending their time business decisions. And that’s the purpose of analytics.
Is that, given the data, and I look at the data and data, and it’s something, say, We need to create a campaign now to hand one product line.
And now I’m going over, and I’ll make your presentation to the marketing part.
And that’s, what analytics should do for a company, should make the company better in some fashion or another.
And we want analysts to be spending their time thinking about that, to make those decisions, rather than working in terms of bringing all this data together, and enmeshing it up, which a lot of analysts do spend a lot of time on. So the model does handle that. And even though there is upfront cost, in the end, you should be much more efficient company because of the process.
And you should be a much more profitable company, because that will make it better, more timely business decisions, that ends up in more profit. So what is the cost of a dimensional modeling? And then how does that all work together?
So without a dimensional model.
You wouldn’t be able to do some of the historical changes, and we use these terms.
SCD 1, SCD 2 stance was slowly changing dimension, which tracks some of the historical changes, and there’s different types of that.
By formatting dimensions and facts, the user doesn’t know need to know about the program.
Just, once it’s done, it gets handled overnight and it’s done, and it’s all set to the fact tables and dimensions are tied together. So they use, it just has to just, drag and drop, and get that. Now if you are doing that also for file, or remember, files from a data lake or just cobbling data together, and typically you don’t have the ability to do that.
This is one of the big advantages of having Dementia model, is that ability to track historical changes over time.
Shared dimensions, another big part of the processes, with having different dimensions that hook into, having one dimension that hooks into many different business processes.
Fact tables are typically business processes.
A sales order, purchase, invent is real business models that hook into legal product dimension, and a single date dimension.
And because of that, we can report on those things over time and in the same report.
And it’s all, therefore, you saw it again, the whole idea of providing this to an analyst.
So that becomes drag and drop, filter Enslaves type of functionality.
Then the date dimension. And this is something that a lot of third party tools now have down pretty solid.
They have a way of, know, you can right click a date, I believe in Tableau, where you just say, Let me see the last six months.
Yeah, so everything’s relative to whatever you’re trying to do with the single date, and date dimension in the model is something that gives you a lot of flexibility. So I have a data on my fact table one date.
And I haven’t connected to the date dimension.
And so if I want to see analytics over the last four years, then I can choose calendar year, and have that, take a look at it.
And then, suit by month, I just take your out at crab Month field like pull that in, and it off it goes. So it’s very flexible. And with date dimension, you can have many, many columns. And the other part of it is, you know, it’s connected with one key.
You can have many fields that are normally not available, unless your business has particular reason for fiscal calendars are a good example.
What is a data model? What are we talking about? And I’ve told some terms around here. I hope everyone can understand.
But let me take a moment to, to work with some of that.
Uh, the data model versus what data warehouses?
So, a data model could be anything from the sense of what I mentioned before with Excel, if you pull in different sources into Excel and mash it up and create some calculations. That is a data model.
It is a simplistic one.
It’s done ad hoc, and but it’s still considered a data model.
Dimensional model takes it a step further.
It’s a very specific construct, dimensions and facts, and there are different types dimensional modeling, but standard way and the one the demo today is kimble method data model, which is comprised of dimensions and facts. Data Warehouse is a place where we store those dimensions and facts in the forms of tables in the database.
And when the data gets loaded, and it goes through many processes, the data warehouse, many processes and ETL or ELT, which stands for either extract, transform, and load, or as the other model is extract, load.
And transform, in the end, you end up with tables, and that are either dimensions or facts in a dimensional model. And they are database tables.
Or in the new data Lake house experience, that could be files, in a data lake.
But they are essentially when I say tables or files, we’re looking at columns and rows. And then all inter-related.
What a data warehouse is not. So, this is interestingly, now have the system of data warehousing, I’m sure. Many of you have been in a company where it says, oh, yeah, we have data in a data warehouse, we have a data warehouse for everything. How good you can go and take a look.
And you say, well, it’s not really a data warehouse, or something we would call an operational data store, which is to say, you can have a replicated version of the application database and have it off into its own data on server.
So, that doesn’t interfere with the application side, and we addressed that earlier. But the application side, you don’t want to run reports on the application database.
And so, the first thing, people thought, well, let’s get that over to a different server. There are replication technologies and databases. So that it can order gets placed into one system. It just does a second insert into the other system.
It’s very efficient way of handling keep the systems in sync, and then the reporting people would just report against the operational data store and not interfere with the application database.
And that’s very efficient for removing the load off the application side.
But a transactional database is not set up in the right way to handle analytics. And I guess the example that I mentioned before, if we had six, I’ll see what this product in the last six months and compare it to the six months at the same period last year. That’s, not an easy thing. And it’s something that you have to do some complicated SQL to write it. You have to do a lot of different things in order to do it. And then it could take some time as well.
We put with the dimensional model, we take that transactional data, we move it into tables, dimensions, and facts.
So that, particular process, which I just mentioned six months, over six months, it’s much more easier to do, hmm.
A place to store tables built to service one report. So this is again, what a data warehouse is not.
Yeah, People that have had what they call data warehouses.
And basically, let’s take that example up the last six months over the prior six months and you say, all right, well, it’s complicated SQL takes about a half hour to run.
I’m going to now run that overnight and then store the results in that database.
And that table will then service the report. So, when someone gets underway, but the next day, it’s a quick and easy report meeting directly off the table, doesn’t have to involve the complicated SQL.
So that’s also a good step in the right direction, from an operational data store perspective.
But what quickly happens is everyone tries to service all of these requirements, and so you get many, many tables, a lot of it, with duplicated data all over the place, and it kind of grows into a maintainable mess, a unmaintainable mess, I should say.
So, again, it’s not a place where you just storing constantly temporary tables, so they can service one report.
That’s not a data lake.
Data lake has very specific use case.
And it’s a great thing for us in the data world that this has come about. And we have a place to store any kind of data, whether it’s structured data or unstructured data, it could be videos or images or documents, things that don’t have a tabular structure, and we throw it into the data lake. And then, there are techniques that are made around that so that you can do analytics off of those type of data, this data types.
So, it is not, the data lake has a wonderful use, but it is not a data warehouse is not Data lake.
Go through this kind of quickly, Inmon versus Kimball in the data warehouse methodologies, so these two.
The first to come up with Data Warehousing in the mid- nineties and Kimball is pretty much emerged as the one that’s used most often. And we’re going to follow that today in our presentation.
I’m suggesting everyone if you want to get a little bit understanding just Google Inmon versus Kimball, and you’ll see plenty of good articles out there, that, which just the differences between the two data vault is also. It’s not quite data warehousing.
But it is a construct that is, has been made for larger organizations to kind of make scalability easier if you want to look into the data vault technologies as well. Please do.
For this we’re going to look at the Kimball Method Star schema approach. And that’s the idea of having a fact table one relation to each of the dimensions. Alright. So, that, again, reduces the, the relationships and, and the query.
Last word, here. So, just how complicated a query can be. It’s now much less complicated, and simplified when you only have the one relationship in here to worry about, And, with fact, tables, you can have many fact tables. Even though a star schema, as shown here, at one point in the middle, that kind of reaches out and start Gibson star schema. But, for the most part, with Data Warehouse, and you have many fact tables, that can connect to the same dimensions. And that gives you a good way to combine the different business processes and report upon them across shared dimensions.
OK, so Kimball methodology, Dimensions, are the nouns of the organization.
You have customer, product, employee, these are all nouns, and then within the nouns are these attributes. Now, a dimension is nothing more than a table in a database. We’re using it as terminology. We’re calling dimensions because it’s a certain constructed table. And an attribute is nothing more than a field within that table. But the ideas and an attribute describes the dimension.
So, if you can see here, first name is an attribute of the employee dimension, describes the employee.
Color, is an attribute of the product dimension.
All of that is, how we do dimensional modeling, is we try to identify the nouns within the company that that the reporting uses and what a company would be considered to give in terms of key performance indicators, how can we measure the success of our company, or the different parts of our company.
And, uh, dimensions and attributes are a big part of that construction, which allows us to look at the different aspects of the fact tables, because they are connected activities.
Now, the fact tables are events. So, what is going on in an organization?
As an event, sales Order gets placed on a certain date and time.
It has a certain amount.
I want to talk about measures after this. Purchase orders the same thing.
It’s, something that it gets placed a certain time, goes into a database with a time stamp on it.
Inventory, how much inventory do I have for a certain product? On this day?
Hiring and attrition are HR departments. Measures in terms of, you know, hiring more, or firing, or letting go people more.
And, how do we measure those over time?
The measures, in a fact table, what we want is we want to have keys that point to the dimensions that they’re associated to. But we also then have the measures themselves.
So for sales amount would be a critical piece to measurer.
Cost would be another one. So how much does this cost and how much is it always selling it for?
And they and the metrics that we can build off that quantity on hand for inventory and headcount for HR.
These are the measures that we determine inside a company as whether or not we’re doing well based on this.
When you do it dimensional model, you know, like to take the many fact tables with the shared dimension, so you list them out in this bus matrix to keep part of Kimball methodology.
One of the deliverables for every contract I’ve ever been on is this bus matrix, and it’s simply Excel.
With the column, here is the fact tables, or business processes listed across all your dimensions here.
And where they interact with one another, they get an X in the middle, so sales order has a date to it.
It has an item, and as customer, doesn’t have vendor, because all are vendor stuff whereby we’re going to buy items from, to make our products.
Alright, so I don’t need, vendor is not going to be associated with sales order, but down here bill of Materials. You’ll see vendor will be there because it’s that’s where the vendor comes into play. So we’re not going to connect to our sales order, but you can see here. There’s a number of places where there are shared dimensions item and everyone here and we should have an X here for date in terms of the cost of an item.
We have a fact table based on that.
But that’s key part of the dimensional modeling process.
The bus matrix, dimensional model allows, once you have those shared dimensions across the fact tables, you pull it into, let’s say, a third party tool, and you create a model out of it, and that model has relationships from the fact tables to the dimensions.
And analytics tools are typically equipped to handle this dimensional model.
They’ll identify if this a foreign key on a fact table, that’s the same as a primary key on a dimension.
And they’ll create the relationship for you.
And then once you have that relationship, then the third party tool should just be drag and drop.
Filter and slice, that makes analytics much more efficient from a reporting side, from an analytics side, so that people can do more analytics instead of building out the model itself.
The tracking history, this is one of the things we’re going to demo, I’m going to show you will slowly changing dimension looks like.
We’re going to show you how a customer changed over time, and how that got built into the process.
And then once it gets built into the process, the fact table then pulls the right keys to point to that dimension.
And once that’s done, the user doesn’t need to know about it. It’s just in there, and it’s working.
And that’s where a dimensional model adds a ton of value to the analytics, to the analyst itself, because, again, they don’t have to worry about complicated query to write or have someone write it for them.
Now, once the model has done this all tied together, we’re going to show the demo, how that adds a ton of value.
Alright, so, shared dimensions and Type two dimensions. All right?
I’m using Power BI as the third party tool, here.
It could be any of the tools that are out there, as I mentioned, Cognos Tableau, anything that you’d want to use from a dimensional model with.
And you can see here, I’ve built my own very small dimensional model.
But just to show how the sales and purchase orders are tied together, they both have a tie in to the date dimension.
When a purchase order is placed, it’s on a date. Same thing with the sales, it’s on a date. So I can track now.
If I have this common dimension date dimension, I should be able to see both metrics that are associated with each Fact table, and then sliced by date, or kit, in this case, sliced by product.
So we have product here, is part of both my sales and both my purchase orders.
And with that, I want to understand, how do I go about reporting metrics on that?
Now that I have my shared dimensional model, or set to go, let me go to this dashboard here.
And you can see, I have a list of products. They are books.
And S is all dummy data, so don’t look to find the source on this. I just made it up. But what it can show that as if I want to see fact sales on those books.
So I have an amount here this is my measure.
On the sales fact table, I just drag it on.
Now, you noticed, I’m not, I’m the analyst, and I want to see what’s going on here, not, as an analyst. I didn’t do all this work behind the scenes. That’s our data warehouse.
Our data warehouse populated the dimensions, our data warehouse, then populated the fact tables, and we built a model and say, a Cube, maybe in this case would be considered a Queue. And then now, that the analyst comes in and drags and drops, that’s all they do.
And, you can see the breakdown here. It’s very quick. It’s very performant. And you can see the amount for each one.
And if I wanted to, then see a separate business process, my purchase orders, and see how that’s doing, I just come over and I drag it again.
Now, if I had to write a SQL query behind this, or if I pulled, let’s say, a file on my data lake and pulled that into a third party tool, what would I have to do to set that up in order to produce something like this.
And that’s what where data model becomes very important.
We’ve given the analyst a playground to just drag and drop on this.
Now you can see with purchase amount and sales amount, I’ve broken it down by product because the relationship from product is on both of those factors.
I also have date here, so if I want to click on 2020 and just see 2020, and you can see the whole report here isn’t to react interacting with this particular thing.
So if I click on 2019, these are my sales and my purchases for 2019, for each of these.
That’s all what we did in the data warehouse. To load the dimensional model, load the data model.
But the analyst now doesn’t have to worry about.
And if you do that, now is the cost savings, again, if it’s a three month or six month project, and there’s a lot of costs associated with that, the cost starts to come back to you over time, because your analysts are spending time doing good analytics instead of cobbling together different data sources.
All right, so that’s the benefit of shared dimensions. And I’m going to move right into the type two, type one, type two.
Now, this is tracking, changes in history.
And we’re going to show a customer table.
Obviously, that’s not too many people in our customer table at this point, but it’s going to show a very important concept.
When we load our dimensions overnight, we have a business key that associated with, in this case, myself.
And every order that gets placed, has this business key for customer, says, it’s my ID in a certain system, All right? So, with that order, comes that business key.
And we always underscore business key in data warehouse to make sure that people know what that is. But as you can see, I’m in here three times.
And this is a primary key over here on just an integer, but it’s unique for every record.
And in data warehousing speak, we call that a surrogate key, which is why have this underscore S K at the end of it.
And when I load in, we’re trying to track changes.
So the system, every night when it loads and the customer dimension says this is the business key for AC Kinney, Hastens City of San Diego and the State of California.
And every night, it checks to see as this changed. And if it changes, what happens? So, in a Type two, I should mention, I should go back to step back and mention, Type one Attributes or CD one.
Type one is when you just override something, and you lose the history on it, and I guess someone’s phone number as a prime example.
If I had a phone number that was different in San Diego and different New Jersey, let’s say, there’s no real need to keep track, effect change. No one’s going to say, well, this is analytics, when he had this phone number, as opposed to when he had this moment.
So when I update that, if I had a phone number column here, and I changed, when I moved to New Jersey, I would want all three of my records to have that new phone number.
Because I don’t want someone to reporting of 2018 data, and try and contact me, and I still got that old phone number in there. That’s not useful. So, in that case, to type one attribute, you just overwrite the history, and it’s, it’s gone.
You can’t report on the earlier value up.
In this case, there’s no need to, so we overwrite, that’s a type one.
Type one, slowly changing dimension attribute, we should call.
Now, what city and state this is important, I want to know what type of books I’m buying, when I’m in this particular thanks for your marketing and your selling books, and you’re saying, well, San Diego might have a whole different set of books. We want a market than someone living in New Jersey.
And so, I want to track those numbers.
For one, he was living in New Jersey.
As compared to when he was living in San Diego.
And that’s take to change and that’s what this process is here and it’s one of the, the biggest value adds when you’re doing it dimensional model.
So, if this, overnight, when it sees I live in San Diego, California and it’s the start date of 2018 of 5 of 1.
And then all of a sudden, on 2019 of 7, 15, I moved to Los Angeles.
And so, on that night, when it’s processing, it says take the business key AC Kimmy and where does he live? I’ll wait, we’ve got to change here so, instead of stuff overwriting that change, instead of having one record and just overwriting San Diego, Los Angeles, I’m adding a new record in, and I’m putting a new date, start date for me.
And when this first happened, I would have an end date of 9999 12 31, an unreachable date. Just to show that, it’s going to be, it’s the current record.
We also have a flag here that says, well, that’s current or false.
And then, again, you can see on 4 30, 2021, I moved to Randolph, New Jersey. And now I’ve got a new record, a third record that comes in with the start date. And then now this record gets updated to this unreachable date.
And this gets set to true and that’s how my history gets kept in here.
The one thing that you would say as well, ah, now from doing analytics on this, I got to understand, you know what the date is and I have to pick the right record, so I’ve got to query that.
Well, let’s gets handled again in the data warehouse, and this gets handled in a fact, table processing manner.
So, in my fact table, I want to do, and I’m going to show that, I’m going to go back here for a second.
Well, I’m looking at fact tables here.
I’ve got my Facts: Sales Data.
And, these are my cells right now and you can see this an ID, primary key here, but there’s my customer surrogate key here.
And, if you remember, my keys are actually 1, 2, and 3.
So, my sales are all the ones and twos and threes and yeah and there’s other sales from other people.
But, when I built this fact table, I have a process overnight now, so it might dimensions are loaded. And my fact table processing goes. And it says, well, I have this business key AC Kinney. And I have a date of the sale.
And at that date, fills in between.
All right. So if at first state let me just grab that date.
First date here is 2018 o7 o3. And if I go back to that customer table, when I’m building my fact table, now, it says, OK, I know it’s this business key, but: where does the date fall in here? Alright, so the first one here is 2018. So it’s but it’s greater than, this and less than that. So I’m going to use this surrogate key on the fact table.
So, that’s important, the fact table gets built at night, and it brings in the right keys, right primary keys, or surrogate keys.
From the dimensions, it uses sales. So, here I have two sales when I lived in San Diego, because the time-stamp was between that.
Now I’ve grabbed a different surrogate key from when I lived in Los Angeles based on the time-stamp and my business key.
And then all the three is down here, are these two purchases of when I live in New Jersey.
And our fact table process picks up on that.
It says, All right, when I select this, let’s say, I have a source view for my fact table.
Alright? Sales Order source, and in that view, I’m doing some transformation, some cleansing different things like that, but in the end, it’s the records that come out.
And now I join it to my customer Dimension on the business key.
So I have my business key habits underscore bk at the end.
I’m joining my dimension on that business key, but I’m also joining it on when the data entered is greater than or equal to much start date and less than my end date.
And with that, in the select statement, I get the right surrogate key.
An important part of this, if this happens overnight, you don’t need to know anything about it. You don’t need to know the relationship of how the tables got populated.
It’s done, so once it’s done, what can I do with it? Well, does that help me?
So I’m looking at this table here, and it’s just the name of customers and the amounts that are here.
And if I come over to my customer dimension and now I want to see that.
I want to see the breakdown here on state as well.
Now, if I had overwritten the state value, I didn’t keep that history and all of my sales will show up in New Jersey.
And that’s not good for Analytics. I want to know what was file one.
So if I drop it in, you can see my, already, my amounts. When I lived in California versus i lived in New Jersey, because I’m tracking state as type two attributes.
And that’s built-in again, I just dragged and dropped, and I got accurate results. I didn’t need to know what took place behind the scenes.
That’s a huge value from dimensional model perspective, if I could feel confident that my, development team has done what they should do. And it’s been to aid and set it to production. And now I’ve got this wonderful data at the tip of my hands. And I don’t even know, need to know how it does it. It’s just doing it correctly, and, say, I want it to now take that next step, and break it down by city. So, you can see I have purchases in both Los Angeles and San Diego.
And these all can be looked at over time, too. Because we have a date dimension here.
If I want to put a date key.
Now I can see the dates of each of my purchases, and those of the other people, as well.
To all of these dates, anything from Los Angeles should be in that to 2019 timeframe.
You can see the San Diego ones for the early 2019 and 2218 stuff. And then in 2021, I made purchases living in Randolph, New Jersey.
So you can see that that relationship has given me the history, and allowing me to report on it in a number of different ways.
So it’s a huge value add for a dimensional model. People need to see this. When there.
What this can do, when they start the project, they need to have that example of, wow, if I do this like this, then we can just do drag and drop functionality.
I’m going to go to the next slide and talk about one more piece of the dimensional model. Very important date dimension just mentioned earlier, if you have a date dimension hooked into it, you can have the one key from the date into the date, dimension table, and then you can build out a ton of different things that work for you. You can have a lot of different columns that your business wants to see maybe weekend in date. Certainly the calendar year, in the fiscal calendar year, you want to build, and work days, and holidays, and to tell you got flags for workday, since I know my holiday schedule them off this Monday.
So if that’s a holiday, it’s set to Y, And then I’m going to exclude holidays when I’m comparing my numbers against, let’s say, a prior month or something to that effect. Anything that you can build into one date dimension.
Then relative date columns, and a lot of the third party tools that we’re looking at, all have that date, or date intelligence, built-in.
So you can do that type of things.
But I’m going to just show a quick demo on Date and see how.
So this is our date dimension here. You can see it’s one table has calendar, fields, month, quarter, year, all associated with the date field. That’s my key right there. And with that, you know, instead of having to write a SQL function that says, give me the year of the date field, it’s fairly easy one to deal in SQL. But you have to write it out, and you have to do it. So I don’t want to do it out, I want to drag it across and see it.
You can see I’ve got these years here.
And if I want to see the numbers across here, I can just drag them out, and then of these tools, you can then see it, maybe as a bar chart.
And since I have shared dimensions in there, and my date dimension is shared, my purchase orders as well, if I want to see purchase order amount, I just drop in it.
I don’t know if that’s a good visual image. Something like.
Alright, so I can see, right now, a lot more purchases and sales of 2018 was tough, but you can see things got better a little dip in 2020, and then 2021 was good. And then we’ll start off kind of a little lucky here. Maybe we should do something in terms of marketing and promotion. So that’s the type of thing that once the data is already in there, the things we can do.
And so you can also say as well, I don’t want to look at it by county, or I have a fiscal count all about budgeting based on fiscal count.
So let’s take the year out and I’m going to put the fit.
Financial. Yeah, that’s that.
Fiscal cowardice is their financial here.
I’ve put that in, that’s the wrong spot X axis.
So, now, I’m seeing it says 20 21, 22, because their fiscal counter starts in April, And I should just pile drivers across for this up here.
You can see that this one month name itself starts in April and it goes to the March, so the Fiscal count is different. And it crosses over calendar years to the listed as 21, 22, or 19 20.
And so, you have that ability and it does also, if you’re going to do, this might not be the right way to sort this out sort ascending. I want the right to left functionality going on there, so you can build that into your model to, so that even though it’s you can shorten that correct?
What else with date? So one of the things with let me just show you a piece here.
Then we’re going to add a field to the date Dimension Just it.
Something that’s a pretty simplistic date dimension.
What does it take to add something to it or adding it to a Power BI model should be added to?
Should be added to your data warehouse, which it will, but we’ll add it here, as a new column, date dimension, Power BI, as its way of doing things.
But if I wanted to add this, this particular calendar year month, maybe I want to see IT. Year, and then I can slice down to Month. But what if I want to see over the four years that I have, I want to see the year and the month. So I derive a new field, and I add it to my date dimension.
It looks like I’ve already put it in there so once you do that, let’s make it, demo.
Let’s do it, right.
I’ll add it as it different fields.
And so, now instead of using my financial year here, I’m going to use Calendar year month, drop it on here and now I’ve got it over from 2018 August and all the months of the purchases and sales all the way across.
And, again, that’s a value add, that you can take dimensions. Usually, you have hundreds of columns, even, because there’s so much you can do. We can the dates, what was the prior day, no firm.
So for today’s, If my date is today, and my prior days is yesterday, and that flag can change every night in your data warehouse, and you now have the ability to always just filter on the prior day. There’s a lot of things, depending on the business, and how they use data, which can all put it into the state dimension. And it becomes drag and drop, just like in this example, I added that in quickly, drag and drop. It just works.
OK, just to finish off here, there’s other pieces to, we’re not going to do it in terms of a demo. But there are different types of fact tables that you can have. Snapshot pack table is very useful for things like inventory and headcount in HR. Things where, what is my value for the particular day. Now, if I have 20 quantity of 20 for an item, one day, and then I’ve got 18 the next day because I sold two, but then I bought 20, so now I’ve got 38 and I every day I’m inserting those amounts. Those amounts don’t add up over time to give me an accurate picture, I don’t know, 20 plus 18 plus 30.
It’s not going to give me the right number of that, how much I have on hand, but snapshot tables in conjunction with, say, semi additive measures. And the Power BI, Other tools, can handle semi additive measures. Say, I’m not going to add it up over time, but I want to add it up over, let’s say, product. I want to be able to say how many products? One hand on this day?
But at the end of the month, if I want to see what my inventory was on a monthly basis, then I just want to see the last day of each month.
And those semi additive measures allow for that. There’s a lot of functionality that can be done with different types of fact tables, accumulating snapshot as another.
You’ll want to look that up, handles different updates for each entry, and many to many relationships. So those we showed you the star schema with, which is essentially the one relationship from Fact table to dimensions.
But there are certain situations that require a many to many relationship to be built in and handled. And there are dimensional modeling processes for that as well.
So just touching on a few of these other pieces to add to the value of having a dimensional model.
OK, and with that title, I’m going to hand that back to you this time.
OK, great. Thank you so much, Andy.
As in the scope of today’s Webinar, we have some offerings from Senturus to help you with your BI.
If you need any help with business requirements data modeling your cloud and data architecture, both modernizations and migrations, data governance and administration or report some visualizations. We can help out with all of those. So please reach out, We’ll have the kinds of information in a slide at the end.
Additional resources. Have you seen our website? We recently just updated it and we provide hundreds of free resources on there.
Go to the Knowledge Center on the website to get product reviews, tips, insider view’s demos, and learn more about upcoming events like today’s.
Speaking of upcoming events, we have a great one coming up in a few weeks.
We’ll talk about pairing Cognos with Power BI and or Tableau. That’s on July 14th. So please register on our website or use the link shown right there.
Just stick with me for a couple of seconds as we kind of go through a brief Senturus overview.
If you don’t know who we are, we concentrate on BI modernizations and migrations across the entire BI stack.
We provide a full spectrum of BI services, training in power BI, Cognos and Tableau, and proprietary software to accelerate bimodal and BI migrations.
We particularly shine in a hybrid BI environment with some of our tools that were are currently using today.
And, finally, this isn’t our first rodeo. We have been focused exclusively on business analytics for over 20 years, our team is split large enough to meet all your business analytic needs, yet small enough to provide personal attention.
And then, we are hiring. If you are interested in joining us, we’re always looking for skilled people in a number of positions, specifically a managing consultant. For the description of this position, as well as other openings, Check out the Careers link on our website.
Now, the fun stuff, Q and A If we’ve got just a couple of questions throughout the webinar today, I’m going to go through those.
What are some ways to connect multiple facts table, in a single model?
Yeah, so, and going back to that piece for the dementia model I showed that if you have common dimensions and something like customers, this, Well, Even product, right?
Product is almost involved with everything.
Now, there’s not for HR, but for product.
If you are manufacturing, retail, environment, it’s essential that you, when you build your data warehouse, that you combine it with as many fact tables as you can.
So, obviously, purchase orders, sales orders, inventory, these are the pieces that you have a separate fact table for each, and it has each of those has one relationship to the product dimension.
So, when you do, create your data warehouse, you’re pulling in, you’re loading, and defining and loading your dimensions first. And then fact tables will then have to check against those dimensions. Much like it is here, to get the business key, trying to highlight this piece here. So that’s my join.
To the customer dimension, we talking customer here, and it gives me the business key.
And if I’m doing something else for customers, say invoices.
I have a customer, and I’ll do the same thing left, join on invoice, and I’ll have the customer key. And then, the date in, and join to my invoice dimension in the same statement, here, and I’ll come up with an invoice key.
All right, so there are many dimensions that can be handled in, with multiple fact tables joining too many dimensions and I should, customers should be customer will be restating this one. Invoice would be a separate fact table that would join to the customer dimension, as well, so you would have two fact tables. Fact sales order, and fact invoice, and both of those would join back to the customer dimension.
Once that construct is in place, you are doing just what you have here.
In this demo, showing the amount from one and this is product here.
So products connected to both fact sales order and effect purchase order.
Then you can see this breakdown once that constructors is made.
Perfect. Thank you.
Another question about fact tables as well, do you build the fact table completely new by loading, or the load, only the changes, and what about cancelations? Or how do you show retroactive changes in a customer dimension?
Lots of lots of questions in one there.
Yeah, so there is a good.
Those are good data warehouse ETL questions. So the fact tables, you look to do the inserts, you’ll look to handle deletes so you bring in the data.
And typically the way, I’ve done it in the past is, you will handle the inserts differently than you handle updates and deletions.
And commonly, there’s not too many of the deletions.
There was one client where I had a Fact table that, you know, the even the same order could come in with a different amount. So it’s a change in the order, not necessarily.
A deletion, and then we handled that by adding a new row for that change.
So if it was 100 of, for the amount than that, or to change before it’s shipped. So they changed it 100 from 60. We’ve put in a new row there, with the change in it and the change. That’s how the company wanted that reported.
February was 100 quantity of 100 of a particular item.
Then they wanted that report in February, and they want it negative 40 reported in March, because now February shortage and closed, and that we have to handle what went into, in March numbers. And while we have to detract 40, because the state change the order before it’s shipped.
So there are different techniques to handle, depending on the business processes and how the companies work.
But yes, inserts, updates, deletes, they all need to be handled. They need to be handled efficiently, which it can be challenging at times, depending on how much control you have over the source systems. So it is a good question.
It’s something that I think, from a specific perspective, you have to come up with those techniques in order to handle it for your situation.
Last question is kind of a unique one. There are different modeling approaches based on whether the database is row or a columnar database, aka like SQL Server versus vertical slash redshift.
So, this question, is there any difference to the way you model based on whether the database is like a row, standard, row based database like SQL Server versus, like a columnar database, like Redshift?
Yes, Power BI is also a columnar database, and by that, it means that they are the old way of indexing is this row based indexing. It’s kind of the standard way and done for a while.
Technologies, in the last 15 years or so, have shifted to columnar indexes, which just, say, I might have two billion rows in a fact table.
But I’m only have 20 customers, right? They’re buying lots and lots of stuff.
And with 200 customers, you can index those 200 more efficiently across two billion rows.
So instead of having two billion an index, two billion rows, an index built off to cut, and that index, just in the column, thanks.
Now, national models, one of the reasons they came up with the form of indexing, at least new that’s 15 years ago, was because of that Model Data warehouse fixing that was created for transactional or application databases Will, based, was very efficient, and worked very well for that. But again, when you change your idea, we using this database, not a report, see this six months compared to the prior six months?
If you have a column, the mixing going to be much more efficient, and in part Kimball method was big for that coming about.
So yes, especially in each one of these things. If I have a product, and I could have 300,000 rows and a product dimension. But if I have six colors that are available for these products, then it’s going to index it on those six in that column. And it’s much more efficient to handle it.
Great, Thank you.
So, we’re at the top of the hour. I want to thank Andy for his time today in this presentation. I also want to thank everyone for joining us today, tuning in. I’m going to leave the Q and A panel open a little bit longer in case you have any questions that just came up. Or you wanted to put one more, and we will fill these out and post them on the website for a later date. So if you have any questions, go ahead and put them in the next couple minutes. And then I’ll end the session and we’ll post those, like I said, in a couple of days. Thank you.