- Apache Spark A unified analytics engine for large-scale data processing — supporting batch, streaming, SQL, ML, and graph workloads on a single distributed platform.
- Apache Storm Free and open source distributed realtime computation system — reliably processing unbounded streams of data, doing for realtime processing what Hadoop did for batch.
- Trino Fast distributed SQL query engine for big data analytics that helps you explore your data universe — querying data where it lives across multiple sources at petabyte scale.
- Apache Hadoop A framework for the distributed processing of large data sets across clusters of computers — the foundational open source platform for scalable big data storage and processing.
- Apache Samza Build stateful applications that process data in real-time from multiple sources including Apache Kafka — with fault tolerance, state management, and exactly-once semantics.
- Apache Airflow A platform created by the community to programmatically author, schedule, and monitor workflows — the standard for orchestrating complex data pipelines at scale.
- HPCC Systems A data lake platform for combining different types of data faster and easier — offering an integrated, end-to-end solution for big data processing and analytics.
- Delta Lake Open-source storage framework enabling Lakehouse architecture — brings ACID transactions, scalable metadata handling, and unified streaming/batch data processing to the data lake.
- Apache Drill Schema-free SQL query engine for Hadoop, NoSQL, and cloud storage — enabling interactive analysis across diverse data sources without requiring upfront schema definition.
- Apache Druid A real-time database to power modern analytics applications at scale — delivering sub-second queries on streaming and batch data for high-concurrency dashboards and APIs.
- Apache Flink Stateful computations over data streams — a framework for distributed stream and batch data processing with event-time semantics and exactly-once state consistency.
- Apache Hive Data warehouse software that facilitates reading, writing, and managing large datasets in distributed storage using familiar SQL — built on top of Apache Hadoop.
- Apache Hudi Brings transactions, record-level updates/deletes, and change streams to data lakes — enabling incremental processing and near real-time analytics on lakehouse storage.
- Apache Iceberg High-performance open table format for huge analytic tables — brings the reliability and simplicity of SQL tables to big data with schema evolution and time travel.
- Apache Kafka Open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration, and mission-critical real-time applications.
- Apache Kylin Open source distributed analytical data warehouse for big data — providing OLAP capability with sub-second query response over trillions of rows in the big data era.
- Presto Open source distributed SQL query engine for interactive analytic queries against data sources of all sizes — from gigabytes to petabytes, without moving data.