Software Engineering for Data Scientists

Machine Learning Vs Software Engineering Differences

Machine learning by nature deals with stochastic aspects. Allowing us to build systems that can work on new data. This is the key difference when compared against typical deterministic software systems. Where the logic is more well defined, unlike machine learning where it can be evolved and modified as we learn.

Testing and Logging Code

This is quite great from building a complex logic point of view. Allowing us to codify and modify very complicated logical steps. This can be very difficult for the typical software business logic. This same benefit presents a huge challenge when running machine learning models in production making testing very difficult. This is a great blog and resource on different types of tests and how to structure them here. This is another thread on working with 200K worth of spaghetti code.

  • Unlike typical unit tests, your new tests need to be able to take stochastic outputs. Even the logs will need to be designed and thought through differently.
  • Testing and logging are the two key aspects to maintaining a large scale enterprise quality project.

Legacy Systems

A legacy system in my definition is any piece of software that has been running for a long time, is responsible for a key profitable unit of business. It would not be maintained if not for making money. Most of the major developers are either not available or can’t be reached easily.

Working Effectively with Legacy Code

Some common suggestions and approaches are these though depending on the requirements, you can use the plain old cowboy style where the end customer becomes the tester for the product.

  • Unit Tests: Unit tests will help us know what the expected value of core functionality is.
  • Refactoring: This will allow us to start to refactor the code piece as per need. Since we have the unit tests, we should be able to modify the code pieces knowing we know what the output should look like.
  • The refactoring should involve breaking large classes into smaller ones.
  • Reuse similar pieces of code and make the whole thing more readable.
  • Document the stack flow and the logic.

Production Machine Learning 101

The basic requirement to get started on the production side is to have a specific problem statement along with the model ready.

  • We know the features to use
  • The model to use
  • The metric, we are using to measure the model.

DataBases and Pipelines

  • Databases with data pipelines and scheduled updates.
  • Ensure indexation and keys (Primary and Foreign)
  • Specific modelling table with the necessary features. This will allow us to pull the data into RAM (Pandas Data Frame) and start modelling.

Servers & Frameworks

The server is an EC2 instance in this context and RDS(OLAP) for databases. Cookie Cutter Framework is useful to be able to structure the workflows. Splits the entire flow into.

  • Data: Temp data sets can be saved as CSV files
  • Features: On the fly feature generation
  • Modelling: The modelling modules go here, this includes any kind of un-supervised steps such as clustering as well
  • Visualization: Extra module to produce, graphs and any other outputs from the reports generated

Machine Learning Operations (MLOps)

This is everything to do with the machine learning system once it’s live and in production. MLOps can be defined as processes, tools and code architecture designed to maintain and improve a machine learning system.

  • The outputs can be varying, how do you test for these?
  • How to catch for drift in the data and the need for new learning.
  • The frequency of training and model parameter updation.
  • Model performance monitoring and how?
  • How to quickly diagnose issues in the system performance.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store