Everyone is talking about Big Data, Cloud and Data Science, as a result Data Scientists are in high demand. At managetopia it is no different, we have seen an increase in requests by clients to support Data Science projects. Clients have relied on SQL databases and Excel to perform basic analysis in the past, but today they are no longer satisfied. They need something more powerful.
Although managetopia has watched the boom in Data Science over the years, we were still surprised when a long-term client approached us and requested we build a predictive model. The goal, to conduct a feasibility study to support investment decisions by utilizing a forecasting model. The question raised: can you draw conclusions on the success of companies today by using different KPI’s from the past.
Apart from the time needed to build the model, a large chunk of the workload was needed for data cleansing, because most of the data sources consisted of Excel files – approx. 100 files incorporating information about different companies - and typically for Excel using a multitude of sheets using different structures. As a result it was clear that the solution would entail applying Microsoft Office techniques for development: Excel and VBA. After drafting the specifications, it was clear that an alternative solution was needed, at least a supplementary tool to Excel and VBA.
The forecasting model should be built using machine learning techniques: decision trees or if necessary, alternative methods such as logistic regression or support vector machines. In addition, optimization of the models should be possible by applying automated tests from different model parameters. Based on the requirements the decision made was to use Python, as the whole process involved data cleansing, analyzing, building the forecasting model and using the tool for visualization of the results.
One of the main advantages of Python lies in the large data science community support and countless extensions available. 2 extensions in particular proved to be invaluable during the project: Pandas and Scikit-learn. The Pandas extension delivers pre-defined data structures and functions for data cleansing and analysis. The Scikit-learn extension was needed for the forecasting model and included an optimization functionality. By applying these methods for the project in comparison with using VBA there was a considerable saving of both time and costs and shorter period needed for realization of the model.
The decision about the success of the feasibility study for the functional forecasting model is pending and it is not clear if the approach will be pursued. Python as a language with its built-in extensions has proved that the increasing utilization of corporate data will be used by managetopia for other upcoming analytics projects.