Pentaho Data Integration(PDI)

Pentaho Data Integration(PDI) is the most powerful open source ETL tool available in market. It is one of the important tools available in Pentaho BI Suite. Also known as ‘KETTLE’, Pentaho Data Integration(PDI) is powerful extract, transform, and load (ETL) solution that uses an innovative metadata-driven approach.It is classified as an ETL tool, however the concept of classic ETL process (extract, transform, load) has been slightly modified in Kettle as it is composed of four elements, ETTL, which stands for Extraction Transformation Transportation and Loading Environment.

Features Of Pentaho Data Integration :

Some of most important features of Pentaho Data Integration which makes it such a powerful tool are as follows :

  • It is 100% Java with cross platform support for Windows, Linux and Macintosh
  • It is 100% metadata driven
  • It has easy to use, graphical design environment for building ETL jobs and transformations, resulting in faster development
  • Since it does not have licence cost, it has lower ownership cost and lower maintenance cost
  • It has over 100 out-of-the-box mapping objects including inputs, transforms, and outputs
  • Extended functionality provided by Pluggable architecture.It has simple plug-in architecture for adding your own custom extensions
  • Data Integration server providing security integration, scheduling, and robust   content management including full revision history for jobs and transformations
  • Lower complexity as no extra code is generated

Components Of Pentaho Data Integration(PDI) :

Pentaho Data Integration is composed of the following primary components:

  • SPOON : Spoon is a desktop application that uses a graphical interface and editor for transformations and jobs. Spoon provides a way for you to create complex ETL jobs .It performs the typical data flow functions like reading, validating, refining, transforming, writing data to a variety of different data sources and destinations.
  • PAN : A standalone command line process that can be used to execute transformations and jobs created in Spoon
  • KITCHEN : A standalone command line process that can be used to execute jobs. The program that executes the jobs designed in the Spoon graphical interface, either in XML or in a database repository. It’s an application which helps execute the jobs in a batch mode, usually used to  schedule the jobs to start and control the ETL processing
  • CARTE : It is a lightweight Web container that allows you to set up a dedicated, remote ETL server. This provides similar remote execution capabilities as the Data Integration Server. Using Carte we can remotely monitor running Pentaho Data Integration ETL processes through a web browser.

Repositories in Pentaho Data Integration(PDI) :

Pentaho Data Integration provides  two ways of storing transformations, jobs, and database connections as described below:

  • Pentaho Enterprise Repository : We can save your jobs, transformations, and database connections in the Pentaho Enterprise Repository which provides you with content management, collaborative development, and enhanced security.
  •  File-Based : If you are not part of a collaborative team and do not want the overhead associated with Pentaho Enterprise Repository, you can save your jobs and transformations as files on your local device. Your database connection information is saved with your job or transformation. If you select this option, your jobs (.kjb) and transformations (.ktr) are saved in XML format.

Transformations, Jobs, Steps and Hops :

  • Transformation : It  is a network of logical tasks called steps. Transformation file names have a .ktr extension
  • Jobs : It is workflow-like models for coordinating resources, execution, and dependencies of ETL activities. Jobs are composed of job hops, job entries, and job settings. Job files have .kjb extension.
  • Steps : They are the building blocks of a transformation, configured to perform the tasks you require.
  • Hops : They are like pathways that connect steps together and allow schema metadata to pass from one step to another. They control the condition based flow of data from one step to another.

Due to features mentioned above, Pentaho Data Integration is one of the Top 10 ETL Tools . Soon i will be writing more about how to create jobs and transformation in Pentaho Data Integration(PDI).
Thanks for reading.

Add a Comment

Your email address will not be published. Required fields are marked *