Pages

Saturday, March 26, 2011

ETL Process Further........

Market Definition

The extract / transform / load (ETL) middleware market comprises vendors that offer tools used in designing and populating analytic databases such as data warehouses, data marts, and operational data stores. Specifically, these tools provide for the periodic extraction, transformation, and integration of data from any number of heterogeneous data sources (frequently transaction databases) into time-variant databases used predominantly for query and reporting (i.e., business intelligence). ETL tools provide developers with an interface for designing source-to-target mappings, transformation, and job control parameters.

Market Forecast

The ETL adoption rate will expand 10%-20% annually during the next three to five years, with the services component (not addressed in this evaluation) growing at a commensurate pace. However, due to price pressure from DBMS vendors offering increasingly competitive offerings at bundled prices, the financial growth of this market will be in the 5%-15% range, annually. ETL market growth will resume, albeit modestly, in 2004, fueled by an aggressive uptake in customer data integration (CDI) efforts and consolidation of departmental data warehouse efforts into enterprise initiatives. Consolidation within the ETL market will not occur during the next few years. Instead, second-tier ETL purveyors will continue to be forced to consider joining forces with business intelligence (BI) vendors. Top-tier ETL vendors, on the other hand, will either absorb or build out enterprise application integration technology along with other emerging forms of data integration (e.g., demand-based enterprise information integration). New entrants (even those from the past two years) will find the market too crowded, noisy, and mature to
compete.

Key Findings

With continued frequency in 2004, ETL buyers are demonstrating a preference for market leaders. Fearing market consolidation/atrophy, shaky services/support, and DBMS/OS version lag, enterprises are willing to make architectural concessions to embrace a market leader. Indeed, price has entered the selection equation more so than in 2002/03, but significant elasticity remains for most Global 2000 enterprises. Nonetheless, the predominant gating factors remain support for the direct extract of an enterprise’s range of data sources
(increasingly including commercial business applications) and target environment support.
Therefore, market leaders will continue to be those with the strongest presence-oriented strategies (e.g., boisterous marketing, exclusive partnering) combined with the broadest array of source/target platform support. That said, there is still room for new market leaders to emerge — especially those that take serious innovative risks while offering the requisite technology stack and generate forceful messaging. Essential criteria for evaluating and selecting ETL vendors include the following: 

  • Platform support: Support for an enterprise’s existing sources, targets, and execution environments should be paramount. Increasingly, support for non-DBMS sources (e.g., Web services, message queues, log files) and commercial business application schemas are critical concerns. 

  • Transformations: Developers require both a broad palette of selectable data transformations and flexibility in developing and incorporating custom business logic. Built-in data quality functions are the latest differentiator.

Market Summary

  • Data management utilities: Having to go outside the tool for high-performance sorting, job scheduling, and data transport can be a nuisance and maintenance headache.

  • Performance characteristics: As more data is integrated faster while batch windows are shrinking, ETL tools often must be able to execute job streams/steps in parallel.

  • Developer environment features: GUI features and flexibility, multideveloper capabilities, code debugging, application versioning, and star-schema helpers are some of the developer niceties to consider.

  • Metadata: Support for a range of metadata sources and interchange standards (e.g., CWM/XMI), metadata browsing/reporting, and repository extensibility can be significant differentiators. 

  • Viability: Even near-term metadata standards do not provide for the porting of ETL applications; therefore, enterprises building out data plumbing should be concerned with long-term support.

Friday, March 18, 2011

ETL process

Implementing ETL process in Datastage to load the DataWarehouse

ETL process 

From an ETL definition the process involves the three tasks:
  • extract data from an operational source or archive systems which are the primary source of data for the data warehouse.
  • transform the data - which may involve cleaning, filtering and applying various business rules
  • load the data into a data warehouse or any other database or application that houses data

ETL process from a Datastage standpoint

In datastage the ETL execution flow is managed by controlling jobs, called Job Sequences. A master controlling job provides a single interface to pass parameter values down to controlled jobs and launch hundreds of jobs with desired parameters. Changing runtime options (like moving project from testing to production environment) is done in job sequences and does not require changing the 'child' jobs.
Controlled jobs can be run in parallel or in serial (when a second job is dependant on the first). In case of serial job execution it's very important to check if the preceding set of jobs was executed successfully.
A normal datastage ETL process can be broken up into the following segments (each of the segments can be realized by a set of datastage jobs):

Top Ten Features In DataStage Hawk


The   IILive2005    conference   is  the  first   public   presentations  of the  functionality in the WebSphere Information    Integration   Hawk   release. Though it’s   few  years back I am sharing that I found  top Ten things I am looking forward to in DataStage Hawk:

1) The metadata server

Using MetaStage is kind of like bathing in the ocean on a cold morning. You know it's good for you but that doesn't stop it from freezing the crown jewels. MetaStage is good for ETL projects but none of the projects we've been on has actually used it. Too much effort required to install the software, setup the metabrokers, migrate the metadata, and learn how the product works and write reports. Hawk brings the common repository and improved metadata reporting and we can get the positive effectives of bathing in sea water without the shrinkage that comes with it.

2) QualityStage overhaul.

 Data Quality reporting can be another forgotten aspect of data integration projects. Like MetaStage the QualityStage server and client had an additional install, training and  implementation overhead so many DataStage  projects   did  not use   it.   I  am   looking   forward    to   more    integration    projects    using standardisation, matching and survivorship to   improve quality once these features are more accessible and easier to use.

3) Frictionless Connectivity and Connection Objects

 we've called DB2 every rude name under the sun. Not because it's a bad database but because setting up remote access takes me anywhere from five minutes to five weeks depending on how obscure the error message and how hard it is to find the obscure setup step that was missed during installation.

4) Parallel job range lookup

 If we looking forward to this one because it will stop people asking for it on forums. It looks good; it's been merged into the existing lookup form and seems easy to use. Will be interested to see the performance.

5) Slowly Changing Dimension Stage

 This is one of those things that Informatica were able to trumpet at product comparisons, that they have more out of the box DW support. There are a few enhancements to make updates to dimension tables easier, there is the improved surrogate key generator, there is the slowly changing dimension stage and updates passed to in memory lookups. That's  DBMS generated keys, only doing the keys in the ETL job from now on! DataStage server jobs have the hash file lookup where you can read and write to it at the same time, parallel jobs will have the updateable lookup.

6) Collaboration: better developer collaboration

Everyone hates opening a job and being told it is locked. Under Hawk you can open a readonly copy of a locked job plus you get told who has locked the job so you know whom to curse.

7) Session Disconnection

Accompanied   by the   metallic  cry  of  "exterminate ! exterminate !"  an  administrator  can  disconnect sessions and unlock jobs.

8) Improved SQL Builder

 Getting the SQL builder to build complex SQL is a bit like teaching a monkey how to play chess. What I do like about the current SQL builder is that it synchronises your SQL select list with your ETL column list to avoid column mismatches. I am hoping the next version is more flexible and can build complex SQL.

9) Improved job startup times

Small parallel jobs will run faster. I call it the death of a thousand cuts, your very large parallel job   takes too long to run because a thousand smaller jobs are  starting and stopping at the same time   and cutting into CPU and memory. Hawk makes these cuts less painful.

10) Common logging

 Log views that work across jobs, log searches, log date constraints, wildcard message filters, saved queries. It's all good. You no longer need to send out a search party to find an error message.

That’s top ten. We also hoping the software comes in a box shaped like a hawk and makes a hawk scream when you open it. A bit like those annoying greeting cards. Is there any functionality you think Hawk is missing that you really want to see?

Data Warehouse

Is a storage space for data tables. It is not a software program. Is accessible through desktop tools. Gives users access to in-depth information on a variety of University data.

A database warehouse is one large Data Repository of all business related information including all historical data of the business organization implementing the data warehouse.

Some of the key components of data warehousing are Decision Support Systems (DSS) and Data Mining (DM).

In all cases, organizations use data warehousing to gain a competitive advantage, support for decision making processes through comprehensive data analysis.

A data warehouse is a place where data is stored for archival purposes, analysis purposes and security purposes.

Usually a data warehouse is either a single computer or many computers (servers) tied together to create one giant computer system.

Data can consist of raw data or formatted data and can be on various types of topics including an organization's sales, salaries, operational data, summaries of data including reports, copies of data, human resource data, inventory data, external data to provide simulations and analysis, etc.

Besides being a store house for large amounts of data, they must possess systems in place that make it easy to access the data and use it in day to day operations.

A data warehouse is sometimes said to be a major part in a decision support system.
A way for an organization to use data to come up with facts, trends or relationships that can help them make effective decisions or create effective strategies to accomplish their goals. 

There are many different models of data warehouses including Online Transaction Processing which is a warehouse built for speed and ease of use.

Another type of data warehouse is called Online Analytical processing, this type of warehouse is more difficult to use and adds an extra step of analysis within the data.