You are here

IBM InfoSphere DataStage Performance Tuning

Introduction

Data integration processes are very time and resource consuming. The amount of data and the size of the datasets are constantly growing but data and information are still expected to be delivered on-time. Performance is therefore a key element in the success of a Business Intelligence & Data Warehousing project and in order to guarantee the agreed level of service, management of data warehouse performance and performance tuning have to take a full role during the data warehouse and ETL development process.

Tuning, however, is not always straightforward. A chain is only as strong as its weakest link. In this context, there are five crucial domains that require attention when tuning an IBM Infosphere DataStage environment :

  • System Infrastructure  
  • Network  
  • Database  
  • IBM DataStage Installation & Configuration 
  • IBM DataStage Jobs 

It goes without saying that without a well performing infrastructure the tuning of IBM DataStage Jobs will not make much of a difference. As the first three domains are usually outside the control of the ETL development team, this article will only briefly touch upon these subjects and will mainly focus on the topics related to the developments done within the IBM InfoSphere DataStage layer. There are also major differences between the underlying architecture of the DataStage Server Edition and the DataStage Parallel Edition. This article will only cover performance tuning for the IBM InfoSphere DataStage Enterprise Edition v 8.x.

One of the first steps of performance tuning, is monitoring the current performance of the DataStage jobs. It is very important to understand what step in the job is consuming the most time and resources. To do this analysis several tools and functionalities of IBM Infosphere DataStage can be used.
 

Performance Monitoring Best Practices

Usage of Job Monitor

The IBM InfoSphere DataStage job monitor can be accessed through the IBM InfoSphere DataStage Director. The job monitor provides a useful snapshot of a job's performance at a certain moment of its execution, but does not provide thorough performance metrics. Due to buffering and to some job semantics, a snapshot image of the flow might not be a representative sample of the performance over the course of the entire job. The CPU summary information provided by the job monitor is useful as a first approximation of where time is being spent in the flow. That is why a job monitor snapshot should not be used in place of a full run of the job, or a run with a sample set of data as it does not include information on sorts or similar components that might be inserted automatically by the engine in a parallel job. For these components, the score dump can be of assistance.

Usage of Score Dump

In order to resolve any performance issues it is essential to have an understanding of the data flow within the jobs. To help understand a job flow, a score dump should be taken. This can be done by setting the APT_DUMP_SCORE environment variable to "true” prior to running the job.

When enabled, the score dump produces a report which shows the operators, processes and data sets in the job and contains information about :

  • Where and how data was repartitioned.
  • Whether IBM InfoSphere DataStage has inserted extra operators in the flow.
  • The degree of parallelism each operator has run with, and on which nodes.
  • Where data was buffered.

The score dump information is included in the job log when a job is run and is particularly useful in showing where IBM InfoSphere DataStage is inserting additional components/actions in the job flow, in particular extra data partitioning and sorting operators as they can both be detrimental to performance. A score dump will help to detect superfluous operators and amend the job design to remove them.

Usage of Resource Estimation

Predicting hardware resources needed to run DataStage jobs in order to meet processing time requirements can sometimes be more of an art than a science.

With new sophisticated analytical information and deep understanding of the parallel framework, IBM has added Resource Estimation to DataStage (and QualityStage) 8.x. This can be used to determine the needed system requirements or to analyze if the current infrastructure can support the jobs that have been created.

Within a job design, a new toolbar option is available called Resource Estimation.

This option opens a dialog called Resource Estimation. The Resource Estimation is based on a modelization of the job. There are two types of models that can be created:

  • Static. The static model does not actually run the job to create the model. CPU utilization cannot be estimated, but disk space can. The record size is always fixed. The "best case” scenario is considered when the input data is propagated. The "worst case” scenario is considered when computing record size.
  • Dynamic. The Resource Estimation tool actually runs the job with a sample of the data. Both CPU and disk space are estimated. This is a more predictable way to produce estimates.

Resource Estimation is used to project the resources required to execute the job based on varying data volumes for each input data source.

A projection is then executed using the model selected. The results show the total CPU needed, disk space requirements, scratch space requirements, and other relevant information.

Different projections can be run with different data volumes and each can be saved. Graphical charts are also available for analysis, which allow the user to drill into each stage and each partition. A report can be generated or printed with the estimations.

This feature will greatly assist developers in estimating the time and machine resources needed for job execution. This kind of analysis can help when analyzing the performance of a job, but IBM DataStage also offers another possibility to analyze job performance.

Usage of Performance Analysis

Isolating job performance bottlenecks during a job execution or even seeing what else was being performed on the machine during the job run can be extremely difficult. IBM Infosphere DataStage 8.x adds a new capability called Performance Analysis.

It is enabled through a job property on the execution tab which collects data at job execution time. ( Note: by default, this option is disabled ) . Once enabled and with a job open, a new toolbar option, called Performance Analysis, is made available .

This option opens a new dialog called Performance Analysis. The first screen asks the user which job instance to perform the analysis on.

Detailed charts are then available for that specific job run including:

  • Job timeline
  • Record Throughput
  • CPU Utilization
  • Job Timing
  • Job Memory Utilization
  • Physical Machine Utilization (shows what else is happening overall on the machine, not just the DataStage activity).
Each partition's information is available in different tabs.
 

A report can be generated for each chart.

Using the information in these charts, a developer can for instance pinpoint performance bottlenecks and re-design the job to improve performance.

In addition to instance performance, overall machine statistics are available. When a job is running, information about the machine is also collected and is available in the Performance Analysis tool including:

  • Overall CPU Utilization
  • Memory Utilization
  • Disk Utilization

Developers can also correlate statistics between the machine information and the job performance. Filtering capabilities exist to only display specific stages.

The information collected and shown in the Performance Analysis tool can easily be analyzed to identify possible bottlenecks. These bottlenecks are usually situated in the general job design, which will be described in the following chapter.

General Job Design Best Practices

The ability to process large volumes of data in a short period of time depends on all aspects of the flow and the environment being optimized for maximum throughput and performance. Performance tuning and optimization are iterative processes that begin with job design and unit tests, proceed through integration and volume testing, and continue throughout the production life cycle of the application. Here are some performance pointers:

Columns and type conversions

Remove unneeded columns as early as possible within the job flow. Every additional unused column requires additional buffer memory, which can impact performance and make each row transfer from one stage to the next more expensive. If possible, when reading from databases, use a select list to read only the columns required, rather than the entire table. Avoid propagation of unnecessary metadata between the stages. Use the Modify stage and drop the metadata. The Modify stage will drop the metadata only when explicitly specified using the DROP clause.

So only columns that are really needed in the job should be used and the columns should be dropped from the moment they are not needed anymore. The OSH_PRINT_SCHEMAS environment variable can be set to verify that runtime schemas match the job design column definitions. When using stage variables on a Transformer stage, ensure that their data types match the expected result types. Avoid that DataStage needs to perform unnecessary type conversions as it will use time and resources for these conversions.

Transformer stages

It is best practice to avoid having multiple stages where the functionality could be incorporated into a single stage, and use other stage types to perform simple transformation operations. Try to balance load on Transformers by sharing the transformations across existing Transformers. This will ensure a smooth flow of data.

When type casting, renaming of columns or addition of new columns is required, use Copy or Modify Stages to achieve this. The Copy stage, for example, should be used instead of a Transformer for simple operations including :

  • Job Design placeholder between stages
  • Renaming Columns
  • Dropping Columns
  • Implicit (default) Type Conversions

A developer should try to minimize the stage variables in a Transformer stage because the performance of a job decreases as stage variables are added in a Transformer stage. The number of stage variables should be limited as much as possible.

Also if a particular stage has been identified as one that takes a lot of time in a job, like a Transformer stage having complex functionality with a lot of stage variables and transformations, then the design of jobs could be done in such a way that this stage is put in a separate job all together (more resources for the Transformer Stage).

While designing IBM DataStage Jobs, care should be taken that a single job is not overloaded with stages. Each extra stage put into a job corresponds to less resources being available for every stage, which directly affects the job performance. If possible, complex jobs having a large number of stages should be logically split into smaller units.

Sorting

A sort done on a database is usually a lot faster than a sort done in DataStage. So – if possible – try to already do the sorting when reading data from the database instead of using a Sort stage or sorting on the input link. This could also mean a big performance gain in the job, although it is not always possible to avoid needing a Sort stage in jobs.

Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in other stage types, when not being able to make use of the database sorting power.

If data has already been partitioned and sorted on a set of key columns, specify the ″don't sort, previously sorted″ option for the key columns in the Sort stage. This reduces the cost of sorting and takes more advantage of pipeline parallelism. When writing to parallel data sets, sort order and partitioning are preserved. When reading from these data sets, try to maintain this sorting if possible by using the Same partitioning method.

The stable sort option is much more expensive than non-stable sorts, and should only be used if there is a need to maintain row order other than as needed to perform the sort.

The performance of individual sorts can be improved by increasing the memory usage per partition using the Restrict Memory Usage (MB) option of the Sort stage. The default setting is 20 MB per partition. Note that sort memory usage can only be specified for standalone Sort stages, it cannot be changed for inline (on a link) sorts.

Sequential files

While handling huge volumes of data, the Sequential File stage can itself become one of the major bottlenecks as reading and writing from this stage is slow. Certainly do not use sequential files for intermediate storage between jobs. It causes performance overhead, as it needs to do data conversion before writing and reading from a file. Rather Dataset stages should be used for intermediate storage between different jobs.

Datasets are key to good performance in a set of linked jobs. They help in achieving end-to-end parallelism by writing data in partitioned form and maintaining the sort order. No repartitioning or import/export conversions are needed.

In order to have faster reading from the Sequential File stage the number of readers per node can be increased (default value is one). This means, for example, that a single file can be partitioned as it is read (even though the stage is constrained to running sequentially on the conductor mode).

This is an optional property and only applies to files containing fixed-length records. But this provides a way of partitioning data contained in a single file. Each node reads a single file, but the file can be divided according to the number of readers per node, and written to separate partitions. This method can result in better I/O performance on an SMP (Symmetric Multi Processing) system.

It can also be specified that single files can be read by multiple nodes. This is also an optional property and only applies to files containing fixed-length records. Set this option to "Yes” to allow individual files to be read by several nodes. This can improve performance on cluster systems.
 
IBM DataStage knows the number of nodes available, and using the fixed length record size, and the actual size of the file to be read, allocates to the reader on each node a separate region within the file to process. The regions will be of roughly equal size.

The options "Read From Multiple Nodes” and "Number of Readers Per Node” are mutually exclusive.

Runtime Column Propagation

Also while designing jobs, care must be taken that unnecessary column propagation is not done. Columns, which are not needed in the job flow, should not be propagated from one stage to another and from one job to the next. As much as possible, RCP (Runtime Column Propagation) should be disabled in the jobs.

Join, Lookup or Merge

One of the most important mistakes that developers make is to not have volumetric analyses done before deciding to use Join, Lookup or Merge stages.

IBM DataStage does not know how large the data set is, so it cannot make an informed choice whether to combine data using a Join stage or a Lookup stage. Here is how to decide which one to use …

There are two data sets being combined. One is the primary or driving data set, sometimes called the left of the join. The other dataset are the reference data set or the right of the join.

In all cases, the size of the reference data sets is a concern. If these take up a large amount of memory relative to the physical RAM memory size of the computer DataStage is running on, then a Lookup stage might crash because the reference datasets might not fit in RAM along with everything else that has to be in RAM. This results in a very slow performance since each lookup operation can, and typically will, cause a page fault and an I/O operation.

So, if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential. After the sort is over, the join processing is very fast and never involves paging or other I/O.

Databases

The best choice is to use Connector stages if available for the database. The next best choice is the Enterprise database stages as these give maximum parallel performance and features when compared to 'plug-in' stages. The Enterprise stages are:

  • DB2/UDB Enterprise
  • Informix® Enterprise
  • Oracle Enterprise
  • Teradata Enterprise
  • SQLServer Enterprise
  • Sybase Enterprise
  • ODBC Enterprise
  • iWay Enterprise
  • Netezza Enterprise

Avoid generating target tables in the database from the IBM DataStage job (that is, using the Create write mode on the database stage) unless they are intended for temporary storage only. This is because this method does not allow, for example, specifying target table space, and inadvertently data-management policies on the database can be violated.

When there is a need to create a table on a target database from within a job, use the Open command property on the database stage to explicitly create the table and allocate table space, or any other options required. The Open command property allows to specify a command (for example some SQL) that will be executed by the database before it processes any data from the stage. There is also a Close property that allows specifying a command to execute after the data from the stage has been processed. (Note that, when using user-defined Open and Close commands, locks should be specified where appropriate).

Tune the database stages for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects. Experiment in changing these values to see what the best performance is for the DataStage job. The default value used is low and not optimal in terms of performance.

Finally, try to work closely with the database administrators so they can examine the SQL-statements used in DataStage jobs. Appropriate indexes on tables can deliver a better performance of DataStage queries.

Also try to examine if the job is faster when the indexes are dropped before data loading and recreated after loading data into the tables. Recreation of the indexes also takes some time, so test if this has a performance gain or a performance loss on the total process chain.
 

Conclusion

Performance tuning can be a labor intensive and quite costly process. That is exactly the reason why care for optimization and performance should be taken into account from the beginning of the development process. With the combination of best practices, performance guidelines and past experience, the majority of performance problems can be avoided during the development process.

If performance issues still occur even when performance guidelines have been taken into account during development, then these issues can be tackled and analyzed using the available, discussed tools such as Resource Estimation and Performance Analysis functionalities.