On the process of data assessment

Proper decision making is based on accurate facts. The step of assessing the situation is especially important when we have to deal with large amounts of intricate data. In this case we need the help of a computer to crunch the data and to provide an overview that can answer our questions. But, how do we get to know what to instruct the computer to do?

I recently had the opportunity to work on a prototype to analyze a large system. The system was composed of several hundred subsystems, each being written in Java. However, the overall system was based on a custom engine that was put together via a multitude of custom configurations written in XML and in another custom language. Given the particularities of the project, a dedicated tool was needed to analyze it. The goal of this pilot project was to craft such a dedicated tool that would be able to provide an overview of these configurations and expose their various relationships.

After 10 days of work, a prototype emerged that offered several surprising new perspectives on the system. I created the tool using the infrastructure provided by Moose. It was an interesting experience compressed in a short amount of time, and I would like to summarize the process taken from the points of view of this particular episode.


The first step in any assessment is to figure out what are the important questions that should be answered from the data. This is never easy but it is crucial. Without it your answers might be correct, but they will not necessarily provide any value. However, in many cases people expect that you already know the important questions. This is especially true when you call yourself an assessment expert. But, just like an expert in software development masters the technological space but does not necessarily know apriori the exact requirements of a client, in the same way an assessment expert only knows how to recognize a relevant question when he sees it, but he help find a way to provide answers. You will never get the perfect overview, but it is critical that you try to learn the context as much as possible.

The second step of a data assessment is to get your hands on the data. This might look straightforward, but it does not come without pitfalls. A set of pitfalls that is not to be underestimated is human related. The first is the difficulty to actually get to see the data. Important data is typically guarded. The more important the data and the larger the organization, the more layers of guardians there are. The second difficulty is that even if you have the approval, it might still be hard to get to the actual data. This might happen because the approval still needs to get to the people that know how to provide you with data and these people should want to give it. But even if they do want to send you the data, they might not get you the right or the complete data. As the success of the project depends on it, it is in your interest to make sure that all the parties involve understand what is needed, and to double check the correctness of the data.

But, I digress. Suppose you do get to see the data. Still, what you will get is raw data. This kind of data is usually not available in the format best suited for analysis, and we typically want to build a model out of it that better accommodate our needs. For this, we need to a meta-model that defines the structure of the model.

In theory designing the meta-model looks straightforward: you look at the data and you recover the structure that best fits your analysis needs. In practice however, it’s anything but straightforward. To begin with, data is typically full of noise and you have to figure out what part of it is actually interesting for the problem at hand. To learn about the interesting parts, you have to know what the data is good for, and for this you need to know the reasons behind it. This typically implies reading documentation, interviewing people that use or create the data, and in triangulating these sources.

Meta-model-example

Once you understand the raw data, you have to map on it the questions you want to answer. You arm yourselves with questions like “What pieces of data do I need to answer my questions?” “How do I best correlate these pieces to get my answers?” and you move on to modeling them together with the data. The resulting meta-model can look like in the UML diagram to the right and it serves as a map that helps us navigate through our model. For example, it can document the path to take if we want to identify the dependencies between entities A and B.

Now that we have the meta-model, we need to populate the model with the actual data. For this, we need to process the data in some way. This might involve some form of parsing, mining, queries over a database, or accessing data through some other communication channels.

Map-example

In my case I needed several parsers to accommodate the various formats, each of these being responsible for populating a part of the model. In particular, I built several XML parsers and one based on PetitParser, a very nice parsing infrastructure developed by Lukas.

Once the model is shaped in the way we want and is populated with the relevant data, we can start toying with questions. There are countless possibilities, each having strengths and weaknesses depending on the problem. We will get into these with another occasion.

In my project I needed to provide first an overview, and then to offer the possibility to learn more details about the interesting parts. Thus, I built a browser that combined queries for detecting high level dependencies, visualization for displaying them and interactive navigation for allowing the user to obtain more details. An example of how a dependencies map could look like is displayed to the right. For crafting the visualizations I used Mondrian, and for building the overall browser I used Glamour.

Of course, like any tool, this one too needed testing. This was done both via regression tests and via running it together with the user. Letting the user experience the tool and the results of the analysis has at least two benefits. First, it provides intuitive feedback regarding the accuracy and coverage of the analysis. Second, it provides and opportunity to educate him in how to interpret the results of the analysis.


This was a short overview. Now, imagine all these steps carried out in iterations, add several missteps due to incomplete available information, and you get an approximative picture of the overall process.

In conclusion, I would like to stress a couple of points:

  • Data assessment requires a driving set of questions and access to data.
  • Specific questions and custom data require specific answers provided by custom tools.
  • Building a data assessment tool is similar to developing any software project and thus appropriate attention and investment should be allocated to it.
  • Using a proper infrastructure can significantly decrease the overall cost.
  • Figuring out what data is important and how to expose it is not straightforward and requires dedicated training.
  • A good tool does not guarantee magic decisions, but it can provide accurate input for them.
Posted by Tudor Girba at 16 February 2010, 5:08 am with tags assessment, presentation, moose link