⚡ Local Inference & Deployment Engineering
- llama.cpp (
ggerganov/llama.cpp): A legendary open-source project that rewrote LLM inference entirely in raw C/C++. It allows heavy GenAI models to run locally on standard hardware (like consumer laptops, MacBooks, and even Raspberry Pis) using advanced 4-bit and 8-bit integer quantization.
👉 ggerganov/llama.cpp GitHub Repository [1, 2] - vLLM (
vllm-project/vllm): A high-throughput, memory-efficient LLM serving engine built for data centers. It implements PagedAttention (inspired by virtual memory paging in traditional operating systems), which manages attention key-value memory to eliminate waste and dramatically increase serving speeds.
👉 vllm-project/vllm GitHub Repository [1, 2, 3] - Ollama (
ollama/ollama): A user-friendly wrapper built on top ofllama.cppthat packages open-source LLMs into lightweight, portable bundles. It provides a simple setup process and a local REST API for developers.
👉 ollama/ollama GitHub Repository [1]