(Credits to Steely Dan for the title)
Corios was hired by a rapidly growing bank to build the newest release of their prospect acquisition scorecard model; not once, but twice: once in SAS 9.4 (their production environment), and the second time in a hybrid SAS Viya / open source approach that leveraged Python, Dask and Spark. The reason for the second modeling effort was to explore what an innovative, modern cloud-focused analytic environment could and should look like to support predictive model lifecycle management: authoring, champion/challenger experimentation, validation, version management, cloud deployment, drift analysis and ongoing refresh.
Building the first, traditional model pipeline was familiar territory for us because we had built several models for this client, that had been put into production over the past few years. The greatest challenge was that the bar was set very high for the mathematical performance of the model, since we had to beat the performance of the current version, which was constructed effectively and exhibited strong performance.
The second model broke a lot of new ground for the client. Major elements included: Amazon Web Services clustered compute, storage, code development and management; Python, Dask and Spark as open source frameworks for analytic pipeline development; side-by-side comparisons for analytics assets built in familiar territory (SAS) and unfamiliar territory (open source frameworks on cloud services); and novel analytics techniques (and their potential performance contributions) that the open source frameworks made available to the bank for the first time in a native, business-critical context.
Anatomy of an Analytic Workload
To help the client think about the new tools and capabilities, we built a simple framework called the “Anatomy of an Analytic Workload” with 11 stages, each stage consisting of analytic design patterns that are familiar to seasoned analytics professionals. For instance, “Extract”, “Filter”, “Explore”, “Deploy”, etc. Then we overlaid both the traditional SAS model and the hybrid open source + SAS Viya model over this Anatomy framework and showed the comparisons between the conventional code involved in the model pipeline and the open source code. This went a long way towards demystifying what our team built in both cases. Such a framework won’t teach a new analyst fresh out of school how to become a seasoned expert, and it won’t teach you how to learn to write the proper code for each part of the project, but it will give you a roadmap to understand how to link the major building blocks together.
Second, when working with open source frameworks, you cannot separate model pipeline development from architecture. Instead the Data Scientist role needs to work very closely with the DevOps Engineer and the Data Engineer roles to make changes to the architecture environment along the way–not because you can, but often because you have to. The Data Scientist will need to understand design patterns, architecture, scale, APIs, code and object interactions and ask the right questions of their peers to adjust their working environment, as they encounter challenges with the open source libraries they’re using, the scale of the data they’re working with, the storage and network implications of working with large data. In contrast, working with conventional systems like SAS 9.4, many of the architecture choices are made once (potentially years ago), they are not fungible, but at least, the software platform abstracts many of the architecture, storage and security concerns away from the Data Engineer and the Data Scientist roles.
Advantages of the cloud route
The choice of Amazon Web Services as the cloud infrastructure and services layer provided our team a lot of power and flexibility to make these adaptations along the way. We added new open source libraries to our toolset, we changed our data formats and storage patterns, we expanded from single compute hosts to large clusters of machines running Spark and Dask workloads, we moved our data back and forth between the open source frameworks and the SAS Viya environment pretty easily.
An important challenge to the data science industry, as we continue to embrace and use open source frameworks, is how to address the needs of model validation and transparency in a regulated business context like banking and financial services.
- Not nearly enough development has been invested in open source frameworks like SparkML, DaskML, XGBoost and even in vendor-driven platforms like SAS Viya, to expose the machine learning model’s inference pipeline internals.
- Nearly every fitted model in a pipeline is constructed as a binary object, and only in some cases can the data scientist inspect the properties of those objects at the level to which seasoned professionals are accustomed. I personally opt for the “trust and verify” approach when validating a model over the “trust blindly” approach that is required when working with these newer frameworks.
- There are some useful developments going on in the area of Shapley model profiling and model drift, but it’s our view that some very valuable established model performance and validation criteria that have existed for decades are now being re-invented in the open source world without sufficient attention being given to ground that has already been broken decades ago.
Interested readers can turn to Chapter 8, “Analytic Model Deployment” in my 2017 book, “Skate Where the Puck’s Headed“, for more detail.