🏛️ Scalable Distributed Computing & Big Data Research
- Ray (
ray-project/ray): Developed out of the UC Berkeley RISELab, Ray is an open-source unified compute framework. It simplifies scaling data science workloads and training advanced AI models across multi-node clusters.
👉 ray-project/ray GitHub Repository [1, 2] - Dask (
dask/dask): A flexible library for parallel computing in Python. It natively scales standard data science tools like NumPy, Pandas, and Scikit-Learn to multi-core machines or large clusters with minimal code changes.
👉 dask/dask GitHub Repository [1, 2, 3] - Apache Spark (
apache/spark): The industry-standard unified analytics engine for large-scale data processing. It provides high-level APIs in Python (PySpark), SQL, and Scala, featuring a heavily researched query optimizer for massive data pipelines.
👉 apache/spark GitHub Repository [1, 2, 3, 4, 5]