AI Engineering, Open Source Adoption, Risk, Reward and Challenges

Aug 12, 2024

Executive Summary

AI engineering and product development are the new focus of regulatory entrepreneurship and innovation in organizations. While tapping into the benefits of the rapidly evolving AI phase, compared to the Data Science waves, there are many bells and whistles to take care of. Data, models, security, reputation management, and ethics take the first seat, along with engineering activities. It is high time AI Engineering, research, and product owners collaborate with legal, IT, and cyber security teams to build safe, secure, and reputation-damage-free products and services. The article presents case studies and exercises covering some prominent enterprise patterns.

Introduction

Over the past few weeks, we have witnessed Free and open-source large Language Models (LLM) outperform their commercial counterparts. From the beginning of Data Science and Machine Learning, which became an interest in widespread adoption, Open Source played a vital role. Everything is available under one or the other Open-Source license, from software frameworks to models and data. When large enterprises invest or adopt Open Source, risks and rewards exist. These rewards include driving new business and cost optimization, while the extreme risk would affect reputation, leading to total loss. We are discussing some of the challenges with use case scenarios.

Open-Source Models and Risk – Case Study YOLO Model

YOLO (You Look Once) is one of the most famous Vision Models[1] introduced in 2015. One of the most popular implementations is the paralytics YOLO model. Let's dive into the case study.

An organization decided to enhance its product capability with AI, specifically computer vision. The team built a new model using a custom dataset the company owned, curated, and annotated. The model's performance was satisfactory for an alpha version. The team verified and validated the software supply chain to strengthen license and security compliance. To the team's surprise, the legal team flagged the YOLO model as a risk.

The model and library's base implementation are licensed under LGPL (Lesser General Public License), which was the cause. However, the license terms for a commercial implementation are different and involve a subscription. This leads to a new challenge of establishing a partnership or building an alternative model. In both cases, time, effort, and cost are involved at different scales. The pitfall was the assumption about the open part of ‘Open Source’ and not looking at the benefit of the model. As an exercise, think through your organization's process, how we could navigate such a scenario, and empower our team to avoid this.

Takeaway from the case study:

- Educate the AI/Data Science team about DevSecOps and legal aspects of the model and data.

- Establish process controls in the CI/CD for early risk detection associated with open source models and frameworks.

Open Source Data for AI/ML and Risk – Case Study

In the early days of AI/ML research, open-access datasets were limited and published by limited institutions. As the field became very popular, individuals and organizations started publishing datasets. Let's examine a study involving an open dataset.

An enterprise team conceived the idea of a layout-aware information extraction system. The objective was to employ computer vision to detect the layout of web pages in various settings. Considering the timeframe and availability of data, they decided to find open data sets to prove the idea. The team found an open data set hosted in GitHub.

The first iteration of models looked promising, and the team presented the initial finding to the product owner. The product owner collected and shared the details of the data with the legal team. The legal team raised a red flag about the data, as more information about how the images were collected was needed. A significant number of images have strict copyright notices.

Take away from the case are:

- It is fortunate that we have access to open data that matches our use case. However, the details of how the data was collected, the source of the data, and the license are important.

- Lack of process and system to assess the risk in early stages.

Open Source Model Adoption – Security Risk a Case Study

The availability of open source models is a real revolution in AI. Enterprises, academia, not-for-profit organizations, and individuals release these models. While we could enjoy the benefits of AI with such a large volume of contribution, there is also a significant percentage of risk associated with it. Let’s examine a case study.

An enterprise team decided to adopt LLM for domain-specific synthetic data generation. Since the commercial off-the-shelf model’s results were not promising, the team explored open LLM. One of the models hosted by an individual was very promising, and the evaluation results were convincing. The team proceeded to host the model as a containerized service in the production environment.

While the DevOps team performed the load test in the test area, the Enterprise Security team isolated the resources belonging to the specific application and its security groups. The team was shocked to hear that the services sent malicious payloads to a server outside the network. The team scanned the model to find a malicious code inside the LLM. It was not a documented CVE but discoverable when the model was in action and observing the network. As the environments were not business critical and had no sensitive data, it might not have caused any reputation damage to the organization. However, if it is undetected, it could be a disaster for the organization.

Open Source Libraries as Dependencies – A Case Study

For the last decade, we have witnessed the trend of developing more Free and Open-Source libraries to solve AI/ML use cases. The software's license is one of the key factors to consider when adopting such a library for product/solution development. Let’s examine another case study.

An enterprise hosted an offline hackathon to foster adoption and innovation in the AI/ML space. After three months of hard work, the team submitted models, applications, and frameworks to the jury. The jury picked the best three in each category. A senior engineering leader was interested in one of the applications and was invited to present to take it forward.

The solution was designed to help support engineers. They could upload machine fault data to the application to generate graphs and what-if scenarios. The users always perform unsupervised learning on such data; the application covered this use case, too. It could improve support staff productivity by 30% and cut five to eight work hours. The engineers were happy that their needs were arranged in one place. They provided feedback for new features and enhancements.

The team secured funding to make it a full-fledged product. The new product owner established a DevSecOps and software supply chain analysis as a first step. While the team made steady progress, they discovered that three libraries used for the product are under LGPL. The team collaborated with legal for options and discovered that one library has a hard dependency, and the rest are not (dynamic linking). They found an alternative permissive license for the one under question and saved the project. In this case study, the process and synergy between teams were key in resolving the issue and saving the team and the company.

Synthetic Data, LLM and AI Risk – An Exercise

LLMs can generate realistic data to supplement AI model development and fine-tuning. However, technical feasibility and practical feasibility are two different aspects. Let’s simulate a case here.

Assume your organization evaluated various LLMs for content generation. The team decided to go with a vendor and later develop our own model as a go-to-market strategy. The challenge later was data availability. Your team made a proposal to generate synthetic data with the LLM (from the vendor we selected), curate with the support of SME, and fine-tune an Open-Source LLM. A separate budget request was placed for the activity. If you were in a leadership position as an AI engineering leader, how would you approach a situation like this?

This is an open-ended question, and to answer, we should go back to the process and practices of our own organizations.

Enabling AI Engineering Team to Success with Open Source

The benefits of Open Source to AI Engineering are undoubtable. However, the risk associated with the adoption should be treated lightly, or else it may cause reputation damage. AI will be a highly regulated industry segment in the next five years, or we are already in the wind. As Engineering Leaders, we are responsible for assessing and re-evaluating the Open Source AI landscape, people, processes, and technology as the first step.

Strengthening your DevSecOps with threats evolving from the AI software and model landscape is a critical step in this journey. The software supply chain analysis with traditional systems like Synopsis Black Duck will be a good starter. There are very good disruptive start-up vendor landscapes focusing on AI/ML security; it is worth evaluating and partnering with such organizations to strengthen your security posture.

Data availability to finetune and synthetic data generation are the two areas that may need focus. The software supply chain analysis tools may not cover the training data and license aspects. It is a process area where we must work with our legal team to craft a framework to equip and educate our teams. The LLMs are empowering us to generate synthetic data, which the teams of these LLMs may prevent us from leveraging the data for AI training. It may further tie back to our organization’s Responsible AI and AI Ethics frameworks and policies.

The success of preventing risk is to prevent it early. Educating our team members with specialized training material digestible in their world of work will empower them. For a developer, a course filled with legal aspects may not be interesting; however, an interactive and easy-to-follow course showing it from a developer's perspective will be effective. Continuous communication on the AI threats landscape to the developer community in our org will help discipline and prevent risk in the early stages.

Cross-organizational collaboration is a key factor in strengthening our AI Engineering. Let's consider AI engineering highly intellectual and IT and IT security mere support organizations, then we are reactive to risk. When IT and IT security are partners in AI Engineering and Development, proactive risk mitigation is possible. Successful organizations bring curated and trusted AI software artifact repos to support the AI Engineers.

Bonus Content

Here are some of the interesting notes/blogs about the topics discussed here.

HuggingFace models security risk - https://jfrog.com/blog/data-scientists-targeted-by-malicious-hugging-face-ml-models-with-silent-backdoor/

Ultralytics YOLOV8 pricing discussion - https://github.com/orgs/ultralytics/discussions/7440

Reference:

[1] YOLO Model - https://arxiv.org/abs/1506.02640

The Linguist Engineer

Discussion about this post