Fivetran

Fivetran, a cloud-based data integration platform, has appeared as a powerful tool that

automates the collection and consolidation of data from diverse sources, streamlining analytics efforts and this all while ensuring data accuracy and reliability.

Today, we find ourselves immersed in a data-driven world where information holds the key to success. However, using the full potential of data requires an efficient and streamlined approach to data integration. Data integration serves as the vital bridge that connects different data sources, ensuring data flows seamlessly and accurately into a central location. Solid data integration enables organizations to combine, analyze, and use data.

Nowadays organizations need to be agile and adaptable. They need the ability to integrate new data sources seamlessly as their business landscape evolves. Delayed or siloed data can lead to missed opportunities. That's why automation and integration acceleration are key capabilities for perfecting data-to-insight velocity. By adopting efficient data integration processes, organizations can easily adapt to changing needs, incorporate emerging technologies, and take advantage of evolving data sources, all while ensuring data accuracy and reliability.

What is Fivetran

Fivetran is a cloud-based data integration platform that allows you to move data from various sources into your data platform. It offers you 300+ pre-built connectors that allow users to easily integrate data from cloud applications like SAP, Salesforce, Google Analytics and many more.

Fivetran has 5 types of pre-built connectors to fetch the data:

SaaS (Software as a Service) Applications: it includes connectors for popular SaaS applications like SAP, Salesforce, Google Analytics, etc.
Databases: these include connectors for diverse databases like SQL Server, MySQL, PostgreSQL, etc.
Files: it includes connectors for:
- Flat file sources like SFTP, AWS (Amazon Web Services) S3 buckets, ADLS (Azure Data Lake Storage), etc.
- Magic Folder Connectors: Google Drive, Dropbox, OneDrive, etc.
Events: it includes connectors for loading data from event logs and streams like Webhooks, Azure Event Hubs, etc.
Functions: if you need a certain missing connector, you can use a cloud function to fetch the data and return it to Fivetran.

All this can be achieved without writing a single line of code, giving data engineers and analysts more time to focus on extracting valuable insights.

At its core, Fivetran revolves around five essential aspects:

Architecture

1. Data movement

It has automated connectors for sampling, orchestrating and managing data movement from 300+ data sources and moves this data into destinations like Snowflake, BigQuery, Redshift and many more. The full list can be found here: Fivetran Documentation.

It takes care of incremental data replication to ensure only new or updated data is moved instead of full loads. It has built-in scheduling, automation, load balancing and retry logic for handling large volumes of data.

It keeps automated track of the data that has been moved already in case of connection outage, the system knows exactly what the last successfully inserted data was and resumes as from that sync state.

Lastly, it has Change Data Capture (CDC) support for database sources, without the need to write a single line of code.

2. Transformations

It automatically applies pre-built transformations while extracting data to ensure a uniform structure.

It has data type handling, null value handling, join handling, date formatting and string manipulations.

It allows you to create aggregations from row level to summary data for reporting purposes

If you want to use your custom SQL-based transformations, these can also be defined for each source

3. Security

Data security is incredibly important when handling any kind of sensitive or confidential information. This is why Fivetran encrypts its data in transit as well as at rest.

For Data in transit:

SSL/TLS encryption: All data transmission is encrypted in transit using TLS (Transport Layer Security) 1.2 or higher.
Access controls: Access to source systems is restricted, based on user roles and permissions. Only authorized connectors can extract data.
Data tunnelling: A dedicated tunnel is set up between the source system and the destination for data transfer.

For Data at rest:

Encryption: Data is stored in an encrypted form at the destination. Fivetran supports encryption schemes like AES-256.
Access controls: Granular access controls, password policies and multi-factor authentication govern access to stored data. Least privilege principles are followed.
Isolation: Destination storage is logically isolated at the account or cluster level for more segmentation.
Auditing: Detailed audit logs monitor access to data and any modification or deletion.
Network security: Firewall rules, restricted inbound network access, VPC (Virtual Private Cloud) service controls supply perimeter security.

Lastly, Fivetran has a comprehensive privacy, security and compliance program, with the following certifications:

SOC 2: Auditing procedure that ensures secure management of data.
Regional DPA’s: Contractually commit to compliance with EU and US laws with GDPR (General Data Protection Regulation) and CCPA.
HIPAA BAA: Fivetran meets HIPAA requirements.
ISO 27001: International standard to review security risks, design and implement controls, and ensure effective ongoing security processes.
PCU DSS Level 1: Business-critical feature for customers who use Fivetran to send cardholder data.

4. Governance

Data governance is crucial for data-driven organizations. Data governance means having roles, policies, processes and metrics to govern data ethically and compliantly throughout its lifecycle. This will maximize the value derived from data whilst minimizing the associated risks.

Fivetran offers the following capabilities that support data governance practices for organizations:

Schema and data drift detection: Changes to source data schemas are detected for proactive monitoring purposes.
Data quality notifications: Warnings are provided on missing data, stale sources and data errors to highlight issues.
Historical data retention: historical data is only kept for a configurable period and not overwritten to cater to auditing needs.
Access transparency: Admins can see which user accessed what data through audit logs.
Granular access controls: Access to data can be restricted based on user roles.
Compliance support: Security and reliability certifications like SOC2, HIPAA, GDPR enable regulatory compliance.

5. Platform & Extensibility

You’re able to use and extend the Fivetran platform to save development time and build better products. Fivetran offers you complete API (Application Programming Interfaces) coverage so that any data integration, configuration or monitoring task that is achievable through the UI (User Interface) dashboard can also be done programmatically via the APIs for more flexible and scalable automation.

Fivetran has some Out-Of-The-Box integrations:

Airflow: Programmatically author, schedule and monitor Fivetran workflows and customize them to your specific needs.
Terraform: Create, update and improve your deployment using infrastructure as code (IaC) software.
Postman: Explore and test the Fivetran API with a collection of requests. Fivetran provides a template for every endpoint.

How does it work

Firstly, we start with the authentication of the source, afterwards, Fivetran takes over. It will pull in the data using micro batches to make sure the source is not overloaded. Next, it will internally normalize this data using a predefined schema per source. An ERD is supplied for most SaaS applications. Additionally, it will create all the proper schemas and tables in the destination. Finally, it will write your data to the destination.

While writing the data to the destination it will keep an internal sync state log which has the last successful sync to ensure data integrity in case of a sync interruption.

It will perform an incremental sync according to a predefined schedule, which you can configure to the lowest every 5min and the max every 24 hours.

During the sync, it performs the following:

Automatic DML Updates when it detects row changes.
Automatic DDL updates when new columns or tables are added in the source to add these to the destination.

This all is configurable through the Fivetran UI without the need to write a single line of code!

All that can be done through the Fivetran UI can be done through the Fivetran API as well for automation purposes!

What if a company does not want to share their data with Fivetran

For certain organizations that deal with highly confidential data like healthcare providers, banks, and insurance companies, there is often hesitation to allow external services to access their sensitive data.

Specifically, they may have concerns that sending personal health records, financial transactions or personally identifiable information (PII), through a third-party vendor, could risk exposure or violation of data protection laws.

For this, Fivetran has a local data processing (LDP) solution. It installs a local data processing hub locally on a Linux environment at the client, where this data can be processed, whilst bypassing the Fivetran UI. It uses a high-volume agent to process the data closer to the source inside the customer's environment. This agent connects directly to the source database/datastore instead of over the internet. The agent can parallelize the data extraction by using multiple threads and connections, which enables higher throughput.

By using a local data hub processing hub, the customer has full control over the agent’s access rights and resources like CPU and memory.

What’s the added value of Fivetran

It saves engineering time by removing the need to build and keep complex ETL/ELT pipelines. The connectors and sync jobs are pre-built and supported by Fivetran.
It accelerates the time to insight. Automated, pre-built connectors replicate data faster than manual coding. Optimizing the data pipeline frees up time for actual analysis. This ultimately shortens the time from raw data to meaningful analytics.
Once the connectors are configured, Fivetran continuously checks the source systems for any updates or changes. When detected, it then automatically captures, transforms, and loads the data into the destination of choice. This automation removes the burden of manual data extraction, transformation, and loading (ETL) processes, allowing data engineers to focus on other important tasks such as data modelling and analysis.
Fivetran's platform is built to be scalable and robust, capable of handling large volumes of data and adapting to evolving data sources. It seamlessly integrates with leading data warehouses such as Databricks, Snowflake, Redshift, etc. With its error handling and data validation mechanisms, Fivetran guarantees data accuracy and consistency, even in the face of complex schema changes or data type conversions.
It keeps data integrity. Fivetran uses schema evolution to automatically adapt as source data schemas change over time. Features like backfill, transformations, and enterprise-level SLAs (Service Level Agreements) help ensure analytics are based on high-quality, reliable data.
Real-time and near real-time data integration is another area where Fivetran shines. Leveraging change data capture (CDC) technology, Fivetran captures and delivers data updates as soon as they occur, allowing organizations to make prompt decisions based on the most up-to-date data. Fivetran enables organizations to stay agile and responsive.

Pricing

Consumption bases
First historical load is free
Afterwards it charged on MAR (Monthly Active Rows) à Monthly Active Rows (new, updated or deleted rows)
- If the same row is updated multiple times, it still counts as 1 (within the same month)
Logarithmic scaling à 10x volume does not mean 10x price
- The higher the volume the smaller the unit price

For more pricing information, you can use the following link: Fivetran pricing.

Summary

In conclusion, Fivetran offers a data integration solution that focuses on automating the ELT/ETL process through the 300+ pre-built connectors, centralized data pipelines, and continuous data replication, whilst not needing to write a single line of code.

Fivetran handles many aspects of data integration like scheduling, CDC, load balancing, retry logic, data quality checks and security. This can accelerate the onboarding of new data sources whilst reducing engineering overhead.

The platform covers the core integration needs, but it also supplies APIs, auditing and monitoring functionality which are needed in our modern data stacks.