HuggingFace has introduced SafeCoder, an enterprise-focused code assistant that aims to improve software development efficiency through a secure, self-hosted pair programming solution. SafeCoder claims to be a comprehensive, security-driven commercial offering, ensuring code remains within the VPC throughout training and inference. Its customer-centric design, enables on-premises deployment and ownership of the Code Large Language Model, just like a personalised GitHub copilot.

Additionally, HuggingFace has partnered with VMware to offer SafeCoder on the VMware Cloud platform. VMware is currently using SafeCoder internally and sharing a blueprint for swift deployment on their infrastructure, ensuring quick time-to-value.

But Why Is SafeCoder Needed?

Code assistants like GitHub Copilot, built on OpenAI Codex, boost productivity. Enterprises can enhance this by customising LLMs with their code, as seen with Google’s 25-34% completion rate from training on internal code. However, using closed-source LLMs for in-house assistants poses security risks, both during training (exposing sensitive code) and inference (potential code leakage). 

Hugging Face’s SafeCoder addresses this, allowing proprietary LLMs built on open models, fine-tuned on internal code, without external sharing. It also offers secure, on-premises deployment for code privacy.

From StarCoder to SafeCoder

Back in May, Hugging Face and ServiceNow partnered to develop an open-source language model named StarCoder, designed specifically for code. This initiative, a part of the BigCode project, has led to an upgraded version of the StarCoderBase model, which was trained on a massive 35 billion Python code segments.

In various assessments, StarCoder exhibited its capabilities on benchmarks, including the HumanEval benchmark for Python. The model’s performance surpassed that of larger models like PaLM, LaMDA, and LLaMA. Additionally, it proved to be comparable to, or even superior to, closed models such as OpenAI’s code-Cushman-001, which was the original Codex model powering initial iterations of GitHub Copilot.

The model, with a staggering 15.5 billion parameters, was trained on more than 1 trillion tokens and utilises a context window spanning 8192 tokens. Data from GitHub formed the basis of its creation, encompassing data from over 80 programming languages, Git commits, GitHub issues, and Jupyter notebooks. 

However, The SafeCoder solution is centred around the StarCoder Code LLMs, developed by the collaborative efforts of Hugging Face, ServiceNow, and the open-source community as part of the BigCode project. These StarCoder models are particularly well-suited for enterprise self-hosted use due to their outstanding code completion performance, which is evidenced by benchmark results and a multilingual code evaluation leaderboard. 

These models are optimised for efficient inference, boasting a 15B parameter design with code optimisations, Multi-Query Attention to minimising memory usage, and Flash Attention enabling context expansion to 8,192 tokens. They were trained on the ethically sourced Stack, an open-source code dataset exclusively containing commercially permissible licensed code. This dataset includes a developer opt-out feature from the beginning and underwent thorough PII removal and deduplication processes. 

While the initial version of SafeCoder draws from the StarCoder model, a key advantage of building upon open-source models is the adaptability to newer models. SafeCoder might incorporate other similarly permissible open-source models in the future, all built on transparent and ethically sourced datasets for fine-tuning purposes.

Training Method

The SafeCoder model, initially trained in over 80 programming languages and renowned for its top-tier performance across various benchmarks, customises its code suggestions for customers by commencing with an optional training phase. During this phase, the Hugging Face team collaborated directly with the customer to aid in assembling a training code dataset and refining a code generation model through fine-tuning. This is done without exposing proprietary code or sensitive information to external parties or the internet. 

The outcome is a model tailored to the customer’s code languages, conventions, and methods. Through this iterative process, SafeCoder customers acquire knowledge about model creation and upkeep, establishing a self-sustaining method that avoids vendor dependency and maintains authority over their AI capabilities.

SafeCoder’s inference capability encompasses diverse hardware selections, including NVIDIA Ampere GPUs, AMD Instinct GPUs, Habana Gaudi2, AWS Inferentia 2, Intel Xeon Sapphire Rapids CPUs, and other options, providing customers with a broad spectrum of choices.

Read more: The Peaks and Pits of Open-Source with Hugging Face

The post After StarCoder, HuggingFace Launched Enterprise Code Assistant SafeCoder appeared first on Analytics India Magazine.