Tools: AWS Athena Federation, Starburst, Trino

Exploring the Modern Query Federation Stack: AWS Athena Federation, Starburst, and Trino

As data infrastructures grow increasingly decentralized, the ability to query data across diverse sources has become a cornerstone of modern analytics. Tools like AWS Athena Federation, Starburst, and Trino enable teams to perform federated queriesβ€”analyzing data without moving it. This post explores how these systems compare, integrate, and evolve within the broader data engineering landscape of 2025.

1. The Rise of Federated Query Engines

In a world where data resides across lakes, warehouses, SaaS platforms, and operational databases, traditional ETL approaches struggle to keep pace. Federated query engines bridge that gap by providing a single SQL interface across multiple backends. Instead of centralizing data, they bring computation to where the data lives.

Key benefits of federation include:

  • Reduced data movement: Minimize data duplication and cost of transfers.
  • Unified access: Analysts can query across S3, PostgreSQL, Snowflake, and even APIs from one endpoint.
  • Governance-friendly: Data stays in its source domain, aligning with modern data mesh principles.

In 2025, the federation stack has consolidated around a few powerful enginesβ€”Trino (and its commercial fork Starburst) and cloud-native integrations like AWS Athena Federation.

2. AWS Athena Federation Overview

AWS Athena started as a serverless SQL query service for Amazon S3, powered by Presto (now Trino). Federation expanded Athena’s scope to query data across multiple AWS and external systems without ETL. Using Athena connectors, engineers can query data from sources like:

  • Amazon RDS and Aurora (MySQL, PostgreSQL)
  • Amazon Redshift
  • DynamoDB
  • Google BigQuery and Snowflake (via custom connectors)
  • On-premises JDBC-compatible databases

Architecture Diagram (Textual)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ AWS Athena Client β”‚
β”‚ (SQL query via console, API, or SDK) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Athena Federated Query Engine β”‚
β”‚ (Presto/Trino runtime on AWS) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ AWS Lambda Connectors (Federation Layer) β”‚
β”‚ β€’ S3 β€’ RDS β€’ Redshift β€’ DynamoDB β”‚
β”‚ β€’ Snowflake β€’ API Endpoints β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
 Data Sources Queried in Place

The Lambda-based connector model is particularly elegantβ€”each data source has a small, stateless connector deployed as a Lambda function. When a query runs, Athena invokes these functions in parallel, federating results back to the engine. This model offers scalability and low operational overhead.

Strengths

  • Completely serverless (no cluster management).
  • Tight integration with AWS IAM and Glue Data Catalog.
  • Supports custom connectors for third-party data sources.

Limitations

  • Limited optimization for cross-source joins compared to native Trino clusters.
  • Connector cold-start latency (due to AWS Lambda).
  • Less control over execution tuning (since AWS manages runtime).

3. Trino: The Open Engine Behind It All

Trino is the open-source distributed SQL query engine originally developed as PrestoSQL. It allows querying data from multiple systems using connectors and executes queries across clusters using massively parallel processing (MPP). Its architecture is designed for high-performance analytics over federated and large-scale datasets.

Core Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Client β”‚
β”‚ (CLI / BI / JDBC) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Coordinator β”‚
β”‚ Parses & optimizes β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Workers β”‚
β”‚ Execute split tasks β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Trino excels at pushing computation down to source systems and parallelizing work. It supports dozens of connectors out of the box, including S3, Hive, Cassandra, Kafka, MySQL, PostgreSQL, and Elasticsearch.

Example: Querying Across Sources

SELECT c.customer_id, o.order_total
FROM mysql.sales.customers c
JOIN s3.orders_data.orders o
 ON c.customer_id = o.customer_id
WHERE o.order_date > DATE '2025-01-01';

This query demonstrates Trino’s federated capabilityβ€”joining customer data in MySQL with order data in S3 seamlessly, using a single SQL interface.

Adoption and Ecosystem

Trino is now one of the most popular open data query engines, widely adopted by Netflix, LinkedIn, Shopify, and DoorDash. Its performance and flexibility make it ideal for organizations pursuing lakehouse or data mesh architectures.

Popular integrations include:

  • dbt-trino: dbt adapter for data transformations.
  • Trino Gateway: Load-balancing multiple Trino clusters.
  • Starburst Galaxy: Managed cloud offering built on Trino.

4. Starburst: Enterprise Trino on Steroids

Starburst emerged from the creators of Presto/Trino to offer a commercial, enterprise-grade distribution of Trino. It adds performance optimization, data governance, and enterprise security on top of the open-source base.

In 2025, Starburst’s products include:

  • Starburst Galaxy: Fully managed Trino clusters in AWS, Azure, and GCP.
  • Starburst Enterprise: Self-hosted Trino with enhanced caching and cost governance.
  • Gravity: Built-in catalog for unified metadata management.

Enterprise Features

Feature Starburst Trino OSS
Cluster Management Automatic scaling and provisioning Manual deployment
Data Caching Smart caching layer with local spill None
Security Fine-grained access control, SSO, and audit logs Basic authentication
Cost Governance Query monitoring and budgeting Limited via external tools

Starburst integrates natively with Apache Ranger, AWS Lake Formation, and Okta for enterprise-grade access management, making it a top choice for regulated industries like finance and healthcare.

Performance Enhancements

Starburst’s smart query routing and data locality optimization significantly improve latency when federating across heterogeneous sources. Its cost-based optimizer (CBO) evaluates multiple execution plans, reducing scan time and improving join efficiency.

5. Comparative Overview

Aspect AWS Athena Federation Trino Starburst
Deployment Serverless (AWS-managed) Self-managed (open source) Managed or enterprise-deployed
Connectors Limited, AWS-focused Extensive (50+) Extended (optimized enterprise connectors)
Performance Moderate (Lambda-based) High (MPP architecture) Very high (optimized CBO + caching)
Security & Governance IAM integration Basic roles Advanced (Ranger, SSO, audit)
Use Case Fit Quick AWS analytics Cross-platform, open analytics Enterprise data federation

6. When to Use Each

  • Use Athena Federation when you want to query multiple AWS-native sources with minimal setup. It’s perfect for ad-hoc analytics, cloud cost optimization, or quick joins across S3 and RDS.
  • Use Trino when you need flexibility, control, and high throughput across diverse data ecosystemsβ€”ideal for data lakehouse implementations.
  • Use Starburst when enterprise governance, compliance, and performance tuning are mission-critical. Large-scale organizations like Comcast and Goldman Sachs use Starburst to power federated BI and self-service analytics.

7. Future Trends: Beyond Federation

Federation is evolving into a broader vision of unified data access. In 2025, leading vendors are integrating AI-driven query planning and cost-based optimizers that dynamically adjust query paths based on source latency and cost metrics. Expect closer integration with data catalogs (like DataHub and Amundsen) and governance layers to provide end-to-end lineage.

Meanwhile, the open-source community continues pushing Trino forward. The introduction of the Iceberg connector with ACID transactions and the Trino-on-Delta adapter has blurred the lines between federated and transactional queries.

8. Example: End-to-End Federation Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BI Tools β”‚
β”‚ (Tableau, Looker, Superset) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚ JDBC/ODBC
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Trino / Starburst Layer β”‚
β”‚ Federated Query Engine β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚ Connectors
 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Data Sources: β”‚
β”‚ β€’ S3 / Lakehouse β”‚
β”‚ β€’ Snowflake / BigQuery β”‚
β”‚ β€’ MySQL / PostgreSQL β”‚
β”‚ β€’ Kafka Streams β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This architecture exemplifies how modern analytics teams can unify access to hybrid data environments while keeping governance centralized.

Conclusion

Federated query engines represent the future of cloud-scale analytics. AWS Athena Federation offers simplicity, Trino provides flexibility and speed, and Starburst delivers enterprise-grade control and optimization. Together, they define a mature ecosystem where engineers can query anything, anywhere, using the language of SQLβ€”without sacrificing performance or compliance.

Whether you are designing a lakehouse, data mesh, or unified analytics layer, these tools provide the foundation for a federated future that prioritizes accessibility, governance, and cost efficiency.