June 13th, 2013 by Chance Coble

This video walks through a more agile approach to building data products. Remember, in agile development working features are valued rather than documentation. And in business intelligence, creating incremental changes to solutions that start with a minimal investment of time and resources is the key to a lean analytics development effort.

To illustrate this point, I focus on some telecommunications data from a client where identifying data has been removed for demonstration. Telecommunications companies use switches that often write every single call they make into a log file, creating a big data repository. Companies bound to traditional smaller data tools will see detailed investigation of these records as prohibitively expensive, and they are typically archived for regulatory reasons with only gross aggregates being stored. The limitation is that creating a data warehousing solution that can effectively load, aggregate and clean all of this data challenges development capacity. While call detail mediation platforms exist, they are often $10,000-$100,000 in licensing costs and then require additional customization.

The dashboard I present in the video includes features that come right out of the box in Yellowfin so that additional development time is replaced by some rapid report assembly and data linking. This analysis was assembled in just a few hours (including the data loading from the log files) which provides information on each trunk group. The features of Yellowfin that we are taking advantage of include; linked filtering which allow us to analyze a single connection group, report summaries which provide dense presentations of dimensions and metrics, drill through which allows us to get to individual records through search and click functionality as well as brushing and linking to tie together analysis across these reports and understand patterns and trends for each connection group.

The reasons for not engaging in this kind of development are common across many industries today. Machine generated data is high volume, high velocity and a development cycle could include expensive tooling and development hours to design a data warehouse and process the data into it. However, statistics like the ones you see here are quite high value as is the ability to search for individual records with techniques like drill-through.

The lean approach for data product construction is to invest as little as possible until we know we are going to get some value out of a data product. In this case, rather than going through a lengthy extract-transform and load process into a fully designed data warehouse solution, we are simply loading the log files directly into MongoDB. MongoDB is an open source NoSQL database which is fast and easy to scale to terabytes of data. Because of the features of MongoDB we are able to easily load billions of minutes of raw call data into it. The video shows a few simple collections that have been created by the load process, and that the raw call data is quite nasty. It contains 170 interdependent fields that include custom formatted time values and even snippets of XML. All of these would typically require further processing for real information retrieval. Remember, the idea here is to use a lean approach of simply starting with the minimum, and then to grow our solution’s features incrementally as we learn and validate our work at each step.

Given that the raw log files were simply loaded into MongoDB, we need to create a connection to it from Yellowfin. We use DataCurrent as a virtual database to connect Yellowfin directly to our loaded call records in MongoDB and test our connection to verify the fields coming through. The video illustrates our connection string using DataCurrent and testing it reveals the collections available for interrogation. The view builder also shows the call detail table and all of the fields which came through automatically as metadata. This simple connection directly to a high performance repository of the raw log data provides simple analysis so that we can begin to see if this is useful information to report on or not.

The entire dashboard shown in the video represents a set of data product features that were assembled with a lean development philosophy with lightweight and open source tools in just a few hours. The smallest investment was made to first see if the data contains useful information, and next we can incrementally build and verify the solution with additional features, business rules and in time even predictive calculations.