I've been thinking a lot recently about how to best approach SRE work with Machine Learning. A lot of the key SLO / SLA processes just don't work well in an ML space - the main issue is how to deal with rolling out models. As part of this, I've been thinking about STPA.
Systems-Theoretic Processes Analysis (STPA) is a hazard analysis technique that can be used to identify and prevent hazards in complex systems. STPA can be a useful tool for improving the reliability of rolling out models for ML systems. Here's how you could use STPA to help with this:
- Identify Hazards: Use STPA to identify potential hazards that could arise during the rolling out of new models. This could include hazards such as model performance degradation, data pipeline failures, or incorrect model configurations.
Identify Control Actions: Identify control actions that could mitigate or prevent the identified hazards. These control actions could include automated testing, canary releases, or monitoring tools.
Evaluate Control Actions: Evaluate the effectiveness of the identified control actions in mitigating or preventing the identified hazards. Use STPA to analyze the interactions between the control actions and the system to ensure that they are effective.
Implement Control Actions: Implement the identified control actions and monitor their effectiveness. Use STPA to continuously evaluate and improve the control actions to ensure that they are effective in preventing hazards.
- Train Personnel: Train personnel on the use of the identified control actions and their role in preventing hazards during the rolling out of new models. Use STPA to ensure that personnel are properly trained and that their actions are aligned with the identified control actions.
By using STPA to identify and prevent hazards during the rolling out of new models for ML systems, you can improve the reliability of the system and reduce the risk of performance degradation or other issues.