What is a federated query and how to do it

A federated query is joining data from two or more disparate data sources. For example, joining data from Salesforce with data from an Oracle database.

Traditionally, querying a database in SQL has been just that - querying a database, as in one single database. The traditional query uses one or more tables from the same database.

While it would be great to have all the data ever needed wrapped up neatly in one database, the reality is that it's just not the case. It's just not possible, practical for plausible to centralize all data. As organizations grow and their data infrastructure becomes more distributed, data engineers, scientists and analysts are increasingly being required to join data and tables which reside in different data sources and might not even exist in the same datacenter, or be controlled by the same entity within the organization.

This kind of "cross-silo"query is referred to as a ‘federated query’. Being able to perform federated queries is critical for data and analytics to scale as data volumes and the demand for data analytics grows.

Jump to O'Reilly book offer - 97 Things Every Data Engineer Should Know

How to write a Federated Query

Consider, for example, a scenario where you have data in a table called ‘clients’ which resides in your local CRM system, and contains part of the information needed to answer a business question. However, the data doesn’t contain all the columns we need--some of this exists in an alternate database from a regional sales office with the id “midwest-clients” and has a table named “client_metrics” that has the missing information required to answer our business question. A federated query that combines the tables from different data sources might look something like this:

SELECT clients.client_ID,
       clients.region,
       clients2.industry,
       clients2.employee_size,
       clients.sales_revenue
 FROM AWS.`midwest-clients`.client_metrics AS clients2, clients
 WHERE clients.client_ID = clients2.client_ID

This is a very simple example of a federated query. In reality, writing and executing a federated query is much more difficult. Here are the top three reasons federated queries are difficult.

Top three reasons federated queries are difficult

Federated queries require knowledge of each of the data sources being queried, such as how to connect, how to join the tables and what columns are in each table and what they mean
It can take hours to write SQL for federated queries, and can result in hundreds of lines of SQL
Your regular SQL query tool or BI solution can't perform federated queries. They work only with one database or data warehouse at a time. That's it!

What's the solution

The traditional approach has been to extract, transform and load (ETL) the data from each source into a single table in a database or data warehouse. That's just not sustainable or fast enough with the way analytics demand and data volumes are growing.

What's needed is a modern data analytics stack that includes these three core capabilities:

A catalog with a searchable view of enterprise wide data
A no-code data preparer and query builder
A federated query engine

The capabilities work together so that one person can easily find, prepare and query data across disparate data silos without needing to move data or have any pre-requisite knowledge the the data before starting.

How does it work

The solution creates a fast and easy workflow for building and executing a federated query that looks like this:

Step1: Search, preview and validate the data is a good fit using the catalog

Data sources are connected to the catalog and are kept up-to-date automatically. Metadata from each data source is used to make each data source easy to understand without needing to be an expert on the data source.

For the ideal solution, data isn't moved to populate and update the catalog and once the best fit data is found in the data catalog it can be can be accessed and used for step two immediately without needing to switch to another tool.

Screenshot of Promethium Data Catalog — Step1: Search, preview and validate the data is a good fit using the catalog

Step 2: Prepare and join the data from each data source using the no-code query builder

Step 2 starts with the data found in Step1 and instead of writing SQL manually the solution allows the user to construct the query, including preparing and joining the tables with a UI driven data map builder, instead of a SQL statement writer.

To boost productivity and shorten the time take, the SQL statement is automatically generated and recommendations for joining tables are provided. At each step of data preparation and joining the tables the results can be previewed in real time.

Screenshot of Promethium no-code federated query builder — Step 2: Prepare and join the data from each data source using the no-code query builder

Step 3: Execute the federated query with the federated query engine

Regular SQL query tools or BI solutions can't perform federated queries, so a fast federated query engine is needed. And that federated query engine needs the ability to make the query available to the tools and people who need it.

Common use cases would be to publish a view for Tableau, Power BI, Qlik, Looker or any other BI or visualization tool, or to access the query for data science with a Jupiter notebook.

See it in action. Request a demonstration

Federated query approach vs the traditional approach

In contrast, the traditional approach is slow, complex and brittle when compared to the federated query approach.

	Federated Query Approach	Traditional Approach
Speed	Fast	Slow
Ad-Hoc Analysis	Yes	No
ETL Needed	No	Yes
Central Repository Needed	No	Yes
Change Management	Easy	Difficult