🏛️ Scalable Distributed Computing & Big Data Research

  • Ray (ray-project/ray): Developed out of the UC Berkeley RISELab, Ray is an open-source unified compute framework. It simplifies scaling data science workloads and training advanced AI models across multi-node clusters.
    👉 ray-project/ray GitHub Repository [12]
  • Dask (dask/dask): A flexible library for parallel computing in Python. It natively scales standard data science tools like NumPy, Pandas, and Scikit-Learn to multi-core machines or large clusters with minimal code changes.
    👉 dask/dask GitHub Repository [123]
  • Apache Spark (apache/spark): The industry-standard unified analytics engine for large-scale data processing. It provides high-level APIs in Python (PySpark), SQL, and Scala, featuring a heavily researched query optimizer for massive data pipelines.
    👉 apache/spark GitHub Repository [12345]

You may also like...