Reflections from TDWI’s Data Science Boot Camp
I have been intrigued by the topic of data science for the past couple of years, ever since our world of traditional business intelligence and analytics has been disrupted by the convergence of classic data analysis and the worlds of advanced analytics, statistics, mathematics, data mining and predictive modeling.
While these are all fascinating concepts, I am skeptical of the potential value of any “shiny new object” that is receiving so much attention. As with the “big data” hype of the past several years, the first layer to peel away is that of the terminology itself. What the heck is data science? How is it different from what we’ve been doing with data all along?
The opportunity to answer some of these questions presented itself with the Data Science Bootcamp being presented at the TDWI Conference this August. I headed off to see what I could learn and I was not disappointed.
How Data Prep Figures in Data Science
The first of the Bootcamp’s four half-day sessions was led by a bonafide data scientist, Dean Abbott. He delivered valuable insights throughout the first overview session and in his follow-up session on topic of Preparing Data for Predictive Modeling.
This data preparation topic is the one that we traditional data geeks get the most jazzed about. It is fascinating to see an entirely new world of data engineering that goes in such a different direction from the old-school concepts of star schemas and EDWs.
As we know across the spectrum of data analysis, preparing the data is hugely important and is typically a significant proportion of the overall effort. I would say that data prep is even more prevalent in the data science and predictive modeling domains. My goal here isn’t to dig deep into the technical minutia, so I won’t go beyond that. Let’s just say that preparing data for these activities does not follow any of our traditional models, and is not typically an activity that is handled very well, if at all, by any of our traditional software tools.
Okay…Data Science is a Form of Science
As I stated earlier, I’m as skeptical about any buzzword that gets as much media attention as data science has over the past couple of years. However, seeing a room full of a couple hundred professionals who proudly wore that title on their business cards, I kept an open mind.
What I quickly took away from the deep and repetitive dives into the process (check out CRISP-DM if you want to know more) is that what this whole world really centers around is the scientific method. I found this fact comforting. What predictive modeling, data mining and data science (or whatever it ends up being called) all had in common was an iterative process of observation, measurement and experimentation – all resulting in the formulation, testing and modification of hypotheses.
I remembered much of this methodology from my days of studying statistics and economic theory in college. Attending this seminar on data science brought me back to the college days, perhaps just because the majority of practitioners in this space are recent grads. But, there was a level of energy, excitement, uncertainty and emotion in this space the likes of which I haven’t experienced since the glory days of the late ‘90s business intelligence explosion.
So, feeling a bit more grounded that what was going here was truly, and accurately, a form of science, I settled back in to hopefully get my entire head around at least the broad strokes of how it all worked.
Breaking Down Terminology
Let’s go back for a minute to breaking down the dangerously misinterpreted and loaded terminology that can be thrown around. Hopefully my above breakdown of the term data science is clear enough. However, this field is also referred to as data mining (DM), predictive analytics, advanced analytics, machine learning (ML) and others.
Think of data science as the encompassing field and the activities of mining data (via machine learning algorithms), building predictive models and performing advanced analytics as the specializations within this field. Sort of like the field of medicine with its many areas of specialization. A M.D. can be a primary care physician, a research fellow, a surgeon or whatever. Hope that helps with some of the vocab.
Open Source Solutions
First, from a software perspective, we know that there are a lot of new tools, and new features being added to existing tools, that help facilitate these activities. Of course the software vendors see this as a potential bonanza since the world of traditional BI has really reached its maturity. However, a lot of the practitioners leverage open-source solutions. I’ve seen a lot of demos of the brand-name solutions. I was impressed that the open-source platforms had similar GUI-based workbenches and available extensions to deal with hand-coding with R and Python.
The best one I’ve seen so far is Knime. If you’re itching to get started, it has an easy download (for free!) link on its site, along with some cool resources. If you’re wanting to take things to the next level, it has enhanced offerings for which you can pay. Another cool resource for novices like myself is Kaggle. Kaggle is most well-known for its data science competitions. It also offers a great resource for real-life data sets on which you can start to build your data science chops.
Data Visualization and Data Science
Coming from the traditional BI space, I’m familiar with the value of visualization. However, I was not as comfortable with how visualizations might have practical applications in the field of data science. Dr. Deanne Larson helped me see the light in the afternoon of day 2.
The first place we typically see the use of visualization in data science is in the process of assessing the quality and “fitness” of the data sources and individual features (think of features as columns of data) that we intend to ingest into our models. What are the meaningful distributions of the data, does it fit a normal distribution model, what are the outliers? This exercise cannot be effectively completed without visual displays of the underlying data. This process is often referred to as data profiling. Think of histograms, box plots and scatter diagrams.
Visualization can also play a valuable role in data preparation, the next step in the process. Assessing the profile of data before and after our various data preparation tasks assists with feature selection, data quality and overall model design. Data quality scorecards and dashboards are common applications at this stage.
As we move into the really sexy part of the process, the modeling, visualization starts to play an even more crucial role. Classification, clustering, decision models, association, forecasting and prediction – are just a few of techniques used to understand the outputs of the statistical model – and none of this information is really meaningful without advanced visualization. Think of decision trees, chord and network diagrams, geospatial displays, line charts and area charts as ways to express the outcomes of various models.
Finally, as scientists, it is our duty to assess the appropriateness of the analytical approach we are taking. Evaluating the performance of a model, or how good it is at predicting results based on actual observations, is another area in which data visualization is a huge help. Each type of model approach has its own optimal assessment criteria. For example, if you are applying a classification model to the data, gain (or lift) charts are a useful analytical tool.
Wrapping It Up
The concepts explored in this post are broad and each of them deserves a much deeper-dive. After reading this post, I hope you were able to take away a better understanding of data science, and you can benefit from some of the same clarity I enjoyed upon graduating from the TDWI Data Science Bootcamp (yes, I’ve included a pic of my certificate).
Thank you to Albert Valdez, our VP of Learning Solutions, for this blog. Albert has more than 17 years of experience in business intelligence education and technical training. Albert founded and runs the Senturus training division and also serves in various roles in the company including senior consultant and solutions architect. He has overseen the growth of the Senturus training practice from a few Cognos authoring classes to dozens of courses covering the breadth of Cognos Analytics and Tableau.