Real Time Data Integration

Abhishek Bhagat
Better Data Platforms
8 min readNov 16, 2020

--

Photo from google Image

There are 500 million of tweets sent each day. That’s 6,000 tweets every second There are 330m monthly active users and 145 million daily users. A total of 1.3 billion accounts have been created. Of those, 44% made an account and left before ever sending a tweet. Based on US accounts, 10% of users write 80% of tweets. 22% of Americans are on Twitter. Twitter analytics provides a wealth of information that can help you create meaningful Tweets that will resonate with your target audience. Few useful insights you can learn from Twitter analytics are:

Tweet engagements and engagement rate shows your Tweets engagement, or the number of interactions your Tweet has received, as well as the engagement rate, which is engagement divided by impressions.

Brand monitoring is the best way to observe what people are saying about your business on the internet.

Competitor tracking has become an essential business intelligence task as it helps companies know what their competitors are up to and come up with counter strategies

In this blog, I have explained various open source real time data integration tools and then used one such tools-talend to build pipeline for real time data integration. I have included various products of the talend, architecture of pipeline, best practices and error handling while using talend, and overview of components involved in the pipeline.

There are a number of free and open source ETL tools such as:

Apache Airflow is a platform that allows you to programmatically author, schedule and monitor workflows. The tool enables users to author workflows as directed acyclic graphs (DAGs).

Apache Kafka is a distributed streaming platform that enables users to publish and subscribe to streams of records, store streams of records, and process them as they occur. Kafka is most notably used for building real-time streaming data pipelines and applications, and is run as a cluster on one or more servers.

Apache NiFi is a system used to process and distribute data, and offers directed graphs of data routing, transformation, and system mediation logic. It features a web-based user interface that enables users to toggle between design, control, feedback, and monitoring.

Talend Open Studio for Data Integration is a free and open source ETL tool. It provides users with a graphical design environment, ETL and ELT support, versioning, and enables the exporting and execution of standalone jobs in runtime environments.

CloverETL (now CloverDX) was one of the first open source ETL tools. The Java-based data integration framework was designed to transform, map, and manipulate data in various formats.

Different products of talend available for real time integration are:

Data Fabric The Data Fabric combines the platform edition of Talend products into a common set of powerful, easy-to-use tools for real-time or batch, data or application integration, big data or master data management, on-premises or in cloud.

Big Data Integration The Big Data Integration platform delivers high-scale, in-memory fast data processing, as part of the Talend Data Fabric solution, so your enterprise can turn more and more data into real-time decisions.

Cloud Integration Integration Cloud puts powerful graphical tools, pre-built integration templates, and a rich library of components at your fingertips. Quickly integrate and cleanse data generated by your organization, your customers, and beyond.

Application Integration Application Integration provides a high-speed service backbone and enables you to build a service-oriented architecture to connect, mediate, and manage services in real-time.

In this example, the pipeline will extract real time feed from twitter API using Python and push to Kafka Server. Then it will extract feeds from Kafka Server and push it to Azure Blob. Below is the architecture diagram:

Architecture Diagram

Below are the steps to build the pipeline:

  1. Install talend open studio for Big Data. It is open source and available on talend website. You need to download open studio from official website and can directly open by extracting the zip file, by running exe file in the folder. https://www.talend.com/products/big-data/big-data-open-studio/
  2. Install Kafka on to your system. You can use below links to install Kafka on windows. https://medium.com/@shaaslam/installing-apache-kafka-on-windows-495f6f2fd3c8 . Once Kafka is installed then create a topic, which we will use in python script and talend to send and extract data. Above link contains procedure to create topic.
  3. You need to have azure account. Create a folder in ADLS gen2 to store data. If you do not have azure account then you can create a free account and create a storage account in Azure. Use below link for reference. https://docs.microsoft.com/en-us/azure/storage/common/storage-account-create?tabs=azure-portal

Steps to Create Twitter API credentials:

First we need to create Twitter dev account to get API credentials.

Login to Twitter account.

Go to below url and create your account access.

https://developer.twitter.com/en/account/environments

You can refer below details for reference :

Details to be filled while creating API account
Details to be filled while creating API account

Get the Keys and tokens and store on Notepad. You will need this in you python script.

Now, Open talend studio. First create a new Job. To create a new Job, right-click on Job Design and choose create Job and give job name and description.

Below is the Python script, that needs to be run using tsystem_1 on talend. after running Kafka server.

Python code to ingest data from twitter API and send to Kafka Server
Input : Feeds from twitter API

Build Kafka Connection. Connect tKafkaInput_1, tFileInputJson_1, tlogRow_1. It will takes data from Kafka topic and create Json file on local disk.

Talend Job 1 - to extract data from Kafka topic

Details on each components used in the Job:

tSystem_1: It helps to executes one or more system commands. It can call other processing commands, already up and running in a larger Job.

tSystem_1 Detail

tKafkaConnection_1: Opens a reusable Kafka connection. The tKafkaConnection component opens a connection to a given Kafka cluster so that the other Kafka component in subJobs can reuse this connection.

tkafkaConnection_1 Detail

tKafkaInput_1: Transmits messages you need to process to the components that follow in the Job you are designing. tKafkaInput is a generic message broker that transmits messages to the Job that runs transformations over these messages.

tKafkaInput_1 Detail

tFileOutJSON_1: It writes data to a JSON structured output file. It receives data and rewrites it in a JSON structured data block in an output file.

tFileOutJSON_1 Detail

tLogRow_1: Displays data or results in the Run console. It is used to monitor data processed.

tLogRow_1 Detail

Next, Build Azure connection and connect tAzureStorageput_1. It will load the data from local storage to Azure blob.

Talend Job 2 — to push data to Azure Blob
Output : twitter Feeds stored in Azure Blob

tAzureStorageConnection_1: It opens a connection to a given storage account in Microsoft Azure Storage. It uses the authentication and the protocol information you provide to create connection to the Microsoft Azure Storage system and enables the reuse of this connection by the other Azure Storage components.

tAzureStorageConnection_1 Detail

tAzureStorageput_1: It connects to a given Azure storage account and uploads local files into a given container of that account. It allows you to upload a whole folder or a number of selected files of that folder from a local machine into a given Azure container.

tAzureStorageput_1 Detail

Use Below Link to get details on all the components of talend:

Best Practices

Below are the list of best practices while building pipeline in talend:

Canvas Workflow & Layout: There are many ways to place components on the job canvas, and just as many ways to link them together. My preference is to fundamentally start ‘top to bottom’, then work ‘left and right’ where a left bound flow is generally an error path, and a right and/or downward bound flow is the desired, or normal path. Avoiding link lines that cross over themselves wherever possible is good practice.

Atomic Job Modules — Parent/Child Jobs: Big jobs with lots of components, simply put, are just hard to understand and maintain. Avoid this by breaking them down into smaller jobs, or units of work wherever possible. Then execute them as child jobs from a parent job (using the tRunJob component) whose purpose includes the control and execution of them. This also creates the opportunity to handle errors better and what happens next.

tRunJob vs Joblets: The simple difference between deciding between a child job versus using a joblet is that a child job is ‘Called’ from your job and a joblet is ‘Included’ in your job. Both offer the opportunity to create reusable, and/or generic code modules. A highly effective strategy in any Job Design Pattern would be to properly incorporate their use.

Entry and Exit Points: All Talend Jobs need to start and end somewhere. Talend provides two basic components: tPreJob and tPostJob whose purpose is to help control what happens before and after the content of a job executes.

Error Handling and Logging

This is very important, perhaps critical, and if you create a common job design pattern properly, a highly reusable mechanism can be established across almost all your projects.

OnSubJobOK/ERROR vs OnComponentOK/ERROR (& Run If) Component Links: ‘Trigger Connections’ between components define the processing sequence and data flow where dependencies between components exist within a subjob.

An ‘On Subjob OK/ERROR’ trigger will continue the process to the next ‘linked’ subjob after all components within the subjob have completed processing. This should be used only from the starting component in the subbjob. An ‘On Component OK/ERROR’ trigger will continue the process to the next ‘linked’ component after that particular component has completed processing. A ‘Run If’ trigger can be quite useful when the continuation of the process to the next ‘linked’ component is based upon a programmable java expression.

You can refer below link to get more information on Data integration using talend:

Summary

Talend is one of my favorite and easy to use ETL tool. I hope this post met its objective of providing you additional clarity. Kindly reach out to me if you have any questions.

--

--

Abhishek Bhagat
Better Data Platforms

Big Data, Data Science, Data Lakes, and Cloud Computing enthusiast.