What is Petals?

Petals introduces a collaborative approach to running large language models (LLMs). It allows users to operate demanding models such as Llama 3.1 (up to 405B parameters), Mixtral (8x22B), Falcon (40B+), and BLOOM (176B) without requiring high-end enterprise hardware. The system operates in a distributed, peer-to-peer manner, similar to BitTorrent. Users load a segment of the desired model onto their machine (compatible with consumer-grade GPUs or Google Colab) and connect to a network where other participants host the remaining parts.

This distributed structure facilitates inference speeds suitable for interactive applications like chatbots, achieving up to 6 tokens per second for Llama 2 (70B). Beyond standard inference, Petals offers enhanced flexibility compared to typical LLM APIs. It supports various fine-tuning methods, custom sampling techniques, and allows users to execute specific computational paths through the model or inspect its hidden states. This integration with PyTorch and 🤗 Transformers provides API-like convenience coupled with deep model access and control.

Features

Distributed LLM Execution: Runs large models across a network of user devices.
Support for Major LLMs: Compatible with Llama 3.1, Mixtral, Falcon, BLOOM, and others.
Consumer Hardware Compatibility: Operates on consumer-grade GPUs or Google Colab.
Interactive Inference Speed: Delivers speeds suitable for chatbots and interactive apps (e.g., up to 6 tokens/sec for Llama 2 70B).
Advanced Model Control: Allows fine-tuning, custom sampling, custom execution paths, and access to hidden states.
PyTorch & Transformers Integration: Offers flexibility through integration with popular ML frameworks.