There is a broad and fast-growing interest in data science and machine learning. It is fueled by an explosion in business applications that rely on automated detection of patterns and behaviors hidden in the data, that can be found by software and exploited to dramatically improve the way we market and sell products, optimize our inventory and supply chain, and detect fraud and support customers. In short, data science and machine learning improve how we make decisions in a wide range of situations based on patterns found in data.
For decades, mathematical modeling in business belonged to an obscure area at the intersection of business and IT. Now it is moving into the mainstream and the rush is on: Where do we find data scientists, how do we train them, and what tools do we give them? Is there a way we can scale analytics and data science to the point where they become a normal aspect of any software development project?
This series of blog posts is addressed to software engineers and technology managers who want to understand, in simple terms, how data science is used to solve common challenges in machine learning.
Learning data science with sentiment analysis and opinion mining
In thinking about the best ways to expose a large number of programmers to the basics of data science and machine learning, we took the same approach that helped introduce Java Spring to millions of developers: the Pet Clinic, a teaching-oriented demo application that is intuitive enough that any developer can relate to its business goals, complex enough to represent real-world requirements, and simple enough to keep the developer from being overwhelmed by complexities found in real-world business applications.
“Social Movie Reviews” is what we’re calling our “Pet Clinic for data science and machine learning,” and here is how we are going to use it to expose you to the world of data science:
- Take a common business application for real-time analytics. We chose an automated public sentiment analysis of Twitter feeds about a selected group of the latest movies. Movie reviews — specifically comments by Twitter users about these movies — are good case study subjects because everyone can relate to the idea, and all the necessary ingredients (dictionaries, training sets, models, and APIs) are freely available as open source.
- Create an end-to-end demo application and open source it. People learn best when they can relate to a business problem, then see a complete solution to that problem end-to-end; play with the resulting business app; examine the technology that makes that application work; then zoom in “under the hood” to understand the relationship between its various parts. To make data science concepts accessible for teaching purposes, we built a simple web application to visualize Twitter analytics data using only open source components. Then we opened up all of the code we used to create it.
- Provide a complete “cloud lab” to run the application and play with it. People learn best by interacting with the system they are trying to understand; running it, testing it, modifying it, making it fail. Yet, one of the biggest barriers to entering the field of data science is the sheer number and complexity of the tools we use to collect the data, store it, model it, implement the models, and finally run the models at full scale. We remove that barrier to a large extent by packaging the entire “data science lab” so that it can deploy on the cloud with a single click, pre-assemble a powerful events processing infrastructure, and give you a nice web client application as a controller and visualizer of the analytics results.
- Finally, we expose and document the “data science toolkit” behind the product: how we went about building the system; what components were chosen and why; what model training approaches were used and why; and what happened as the end result. This data science toolkit is captured in a series of blog posts — this being the first one.
This data science guide explains how we built our Twitter sentiment analysis application in three parts: First, we discuss the data science process and key machine learning terminology. Second, we explain how to understand and process the raw data using dictionaries, machine learning, and test data sets. Third, the guide reviews how to tune the model and visualize insights derived from it.
This blog series is also a logical companion to our series of blog posts on In-Stream Processing, which is a popular approach to building a computational platform for performing mathematical analysis and machine learning. We use our In-Stream Processing Service Blueprint to provide a computational platform used in this tutorial on data sciences.