This seems to be a lot harder and taking a lot longer that I initially thought.

How hard can it be to pull data from several applications into a single EDW, right?

Well, here is a monkey wrench for you. Just about every application has some amount of information about employees, vendors, customers, etc. For right now let us narrow the scope and just look at employees.

We have Dynamics GP that contains the employee first name, last name, etc. However, there is also Active Directory that has the employee first name, last name, etc. In addition, our health care application, and, and, and, well you get the picture. The question becomes if the last name of a specific employee does not agree between these systems for whatever reason. Which one is correct? (Side note: keep an eye out for a future blog series on how to notify people when this occurs.) When pulling the data from the various application databases into a single EDW which application do I really want to use as the "Master" for a specific field like Last Name. Therefore, before I begin pulling information into the EDW let us define some specific types of data. No, I do not mean SQL data types I mean a conceptualized view of what data is, and where and what it means to a specific application. Now, as I am not the greatest person in the world when it comes to making up with names please bear with me, and if after you have read this and you think there is a better name or something I have missed by all means do not be shy and let me know. (Side note: I am also writing this for people at my company so I would appreciate any help what so ever.)

 Anyway, let us invent some terms, definitions, and a few rules for the data that will end up in our EDW.

Concentric Data

Concentric Data is information that is within the sphere of an application and is necessary for that application to operate properly. If this data is incorrect or absent, the application will fail in some manor or will not let you proceed! Example: You cannot enter an employee into Dynamics GP without a unique employee ID. It can be auto generated or manually entered, depending on your Dynamics GP setup. But it must exist.

Superfluous Data

Superfluous Data is information that is not necessary for the application to operate properly yet added to the application for reference or completeness. Example: You do not have to enter an employee's home phone number in Dynamics GP. It would be nice to have if you wanted to call them for some reason but Dynamics GP will operate just fine without it.

Solitary Data

Solitary Data is information that is only contained in one application. Example: Across all of our systems, Dynamics GP is the only system that has an employee's rate of pay. No other system has that specific piece of data.

Relational Data

Relational Data is information that is necessary to connect information from one application to information in another application or applications. Example:  In our healthcare application, each employee has a unique Worker ID, that same Worker ID value in that application is the Employee ID in Dynamics GP. Therefore, I can pull information from our health care application and match it to information from Dynamics GP for any specific employee. In addition, while I would like at least one field-value across all systems. I accept the fact that may not be possible, but I do want as much Relation Data as I can get. The reason is that I can crosscheck the information via multiple links just in case something is messed up somewhere. If Worker ID 12345 in our healthcare application is a woman named Rose, and Employee ID 12345 in Dynamics GP is a man named Fred, then something is wrong. Both applications also have the employee's social security number so I can "fall back" on that to determine which application has the incorrect Worker ID / Employee ID, and do something to get it fixed before it goes into the EDW.

Data Jurisdiction

Data Jurisdiction is the application that controls the "Master" copy of the data. If data from one of more systems do not coincide, the application that has jurisdiction takes precedence over the offending information. Only one application can have Jurisdiction on a specific piece of information. Note, that just because an application needs a specific piece of data to work (Concentric Data above) does not mean that the application has Jurisdiction over that data. Example: Our healthcare application must have something in the "Nurses Licenses" field in order for that employee to make visits to patients. While we try to enter the actual number from their license there are times when it is incorrect for whatever reason. However, our Human Resources (HR) department wants a copy of their license to put into their personnel folder at Home Office. Thus, when HR enters the employee information into Dynamics GP, they are actually looking at a copy of the license. This is also, where we pull the information from when Auditors walk in the door. (Have to love the Health Care Industry). Thus while this piece of data is Concentric to our healthcare application, Dynamics GP actually has Jurisdiction over this piece of data.

Data Rules: 

(as implied from the above terms and definitions)

  • 1. Only one application can have jurisdiction on a specific piece of information.
  • 2. While data may be concentric to one application, another application may have jurisdiction.
  • 3. All systems must have as much Relational Data as possible.
  • 4. All systems should have as much Superfluous Data as possible because not all employees have access to all systems.
  • 5. Solitary Data must be at a bare minimum.



Headed to Convergence tomorrow (Friday 4/8/2011)
Hope to see you there,