Peter India logo

Big Data Analytics (BDA)
Platforms and Tools

A curated directory of 17 Big Data Analytics platforms and tools — covering distributed processing engines, real-time streaming frameworks, SQL query engines, and open table formats for the modern data lakehouse.

  1. Apache Spark A unified analytics engine for large-scale data processing — supporting batch, streaming, SQL, ML, and graph workloads on a single distributed platform.
  2. Apache Storm Free and open source distributed realtime computation system — reliably processing unbounded streams of data, doing for realtime processing what Hadoop did for batch.
  3. Trino Fast distributed SQL query engine for big data analytics that helps you explore your data universe — querying data where it lives across multiple sources at petabyte scale.
  4. Apache Hadoop A framework for the distributed processing of large data sets across clusters of computers — the foundational open source platform for scalable big data storage and processing.
  5. Apache Samza Build stateful applications that process data in real-time from multiple sources including Apache Kafka — with fault tolerance, state management, and exactly-once semantics.
  6. Apache Airflow A platform created by the community to programmatically author, schedule, and monitor workflows — the standard for orchestrating complex data pipelines at scale.
  7. HPCC Systems A data lake platform for combining different types of data faster and easier — offering an integrated, end-to-end solution for big data processing and analytics.
  8. Delta Lake Open-source storage framework enabling Lakehouse architecture — brings ACID transactions, scalable metadata handling, and unified streaming/batch data processing to the data lake.
  9. Apache Drill Schema-free SQL query engine for Hadoop, NoSQL, and cloud storage — enabling interactive analysis across diverse data sources without requiring upfront schema definition.
  10. Apache Druid A real-time database to power modern analytics applications at scale — delivering sub-second queries on streaming and batch data for high-concurrency dashboards and APIs.
  11. Apache Flink Stateful computations over data streams — a framework for distributed stream and batch data processing with event-time semantics and exactly-once state consistency.
  12. Apache Hive Data warehouse software that facilitates reading, writing, and managing large datasets in distributed storage using familiar SQL — built on top of Apache Hadoop.
  13. Apache Hudi Brings transactions, record-level updates/deletes, and change streams to data lakes — enabling incremental processing and near real-time analytics on lakehouse storage.
  14. Apache Iceberg High-performance open table format for huge analytic tables — brings the reliability and simplicity of SQL tables to big data with schema evolution and time travel.
  15. Apache Kafka Open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical real-time applications.
  16. Apache Kylin Open source distributed analytical data warehouse for big data — providing OLAP capability with sub-second query response over trillions of rows in the big data era.
  17. Presto Open source distributed SQL query engine for interactive analytic queries against data sources of all sizes — from gigabytes to petabytes, without moving data.