A Guide to Choosing the Right Tools and Technologies for Data Architecture

Feb 27

3 min read

In the previous post, we explored a framework for translating data requirements into a robust data architecture (If you haven’t had a chance to read it yet, you can check it out [here]). We emphasized the importance of gathering the right requirements, a crucial step not only in data engineering but also in product development and management.

In this post, I’ll focus on selecting the right tools and technologies to build and maintain a sustainable architecture.

Location (On-Premises / Cloud)

Company owns and maintains the hardware and software for their data stack.

Provisioning
Maintaining
Updating
Scaling

Cloud provider is responsible for building and maintaining the hardware in data centers
You rent the compute and storage resources
Easily scale to meet demand or scale back down to save on costs when you don't need it

The industry is shifting towards Cloud-based systems due to their flexibility and scalability. However, some companies choose or are required to keep certain data systems on-premises due to business needs, regulations, or security and privacy concerns.

Monolithic vs. Modular Systems

Monolithic systems are self-contained systems that are made up of tightly coupled components.

Simplicity - one technology and typically one principal programming language
Easy to reason about and understand
Hard to maintain - if you need to update one component, you may have to update other components as well (oftentimes, a whole application has to be re-written)

Modular systems consist of loosely coupled components, a key principle of effective data architecture, as discussed [here].

Interoperability - allows data processing tools to integrate easily with others in the data engineering life cycle. For example, data in Parquet format can be paired with any tool that supports it.
Flexible & reversible decisions
Continuous improvement

In software development, the rise of microservices has led to the emergence of truly modular systems. Instead of combining components from multiple services into a single deployable unit, each microservice is deployed independently, allowing for greater flexibility and scalability.

Cost Optimization

Building on-premise data systems typically incurs high CapEx costs, whereas the shift to cloud-based systems allows many companies to build with little to no CapEx.

Choosing data stack A means committing to its technologies and excluding others, resulting in the opportunity cost of missing out on alternative stacks. If stack A proves optimal, the opportunity cost is minimal. However, data technologies evolve rapidly, and components of stack A may become obsolete, incurring costs to switch. To minimize total cost of ownership, build flexible, loosely coupled systems that can adapt to changing needs. Separate immutable technologies (e.g., object storage, SQL) from transitory ones (e.g., stream processing, AI) to future-proof your data architecture.

FinOps (also a a key principle of effective data architecture, as discussed [here]) focuses on minimizing data system costs (TCO and TOCO) while maximizing revenue potential. This can be achieved by selecting cloud services with a flexible, pay-as-you-go model and modular options for quick iteration and growth.

As data engineer, our job is to provide a positive return on investment the organization makes in its data systems.

Build vs. Buy
Build or buy

Build Your Own Solution

Get exactly the solution you need
Avoid licensing fees
Avoid being at the mercy of a vendor

Use Existing Solution

Open-Source (community)
Commercial Open-Source (vendor)
Proprietary Non-Open-Source

Building technologies from scratch when off-the-shelf solutions are available can be akin to reinventing the wheel. , i.e., "undifferentiated heavy lifting"—a labor-intensive task that likely doesn't provide significant value to the organization.

Server vs. Container vs. Serverless

Server

You set up and manage the server (e.g., EC2 instance)

Update the OS
Install/update packages
Patch software
Networking, scaling, and security

Container

Modular unit that packages code and dependencies to run on a server (e.g., Kubernetes)

Lightweight & portable
You set up the application code and dependencies

Use a container if your application cannot operate within the imposed limits, such as execution frequency, concurrency, or duration.

Serverless

You don't need to set up or maintain the server (e.g., AWS Lambda)

Automatic scaling
Availability & fault-tolerance
Pay-as-you-go

Best for simple & discrete tasks - expensive in a high event rate environment. For a more flexible approach, consider creating and deploying a Docker image for your Lambda function. You can learn more about this process [here].

About

Benjamin ("Benj") Tabares Jr. is an experienced data practitioner with a strong track record of successfully delivering short- and long-term projects in data engineering, business intelligence, and machine learning. Passionate about solving complex customer challenges, Benj leverages data and technology to create impactful solutions. He collaborates closely with clients and stakeholders to deliver scalable data solutions that unlock business value and drive meaningful insights from data.