You land in a foreign city. You have no map and no compass, but you have access to a huge set of detailed pictures of all the streets and buildings from around you. Now picture yourself needing to find the way to the hotel on the other side of the city. How would that be for an experience?
You land in the middle of a software system. You have no map, but you access to a huge set of text files with all the details of the system. Now picture yourself trying to figure out where the center is. How would that be for an experience?
In the first case, we figured out a long time ago that we could do better. We constructed compasses. We came up with the sextant. We built maps. Lots of maps, not just for finding our way, but for figuring out where everything is. And we did not stop there. We went to great trouble to send satellites in space just to help us get an up-to-date picture. We developed smarter systems that figure out the best paths for us, and built GPS systems to make sure we know at all times where we are and where we are heading. We then figured out that personalized maps are even better than generic ones. So we built huge systems that allow each of us to craft these maps that highlight selected things. And more is happening every day in this area. All in all, great expenses. But it was all worth it.
In the second case, we are still mostly stuck with a sea of low level details. We could and should do better.
Digital data has no physical shape. While this allows us to manipulate easily great amounts of data, it poses a problem when it comes to understanding this data and assessing its state. The lack of physical shape renders useless our built-in skill of perceiving the world around us through visual stimuli.
Visualization aims to solve this problem by offering a visual skin to data. “A picture tells a thousand words” goes the old adage. And so it does, but only if the picture is the right one.
What makes a picture appropriate? Well, it has to focus on one or more relevant questions, and it has to take the particularities of data into account.
There are many tools out there providing nice and useful visualizations for interesting questions. However, many of them offer only limited customization possibilities, and this makes them less useful in particular circumstances.
To address this issue, in 2006, Michael Meyer and me (more recently joined by Alex who took over the main maintenance tasks) built Mondrian, a scripting engine based on Smalltalk that allows us to craft custom visualizations for custom data. Mondrian is an important part of the Moose platform and can be tried by downloading the latest Moose release.
The name Mondrian comes from the famous Dutch painter that used to see the world in rectangles and lines. In a similar manner, our Mondrian views data through a graph lens with nodes connected with each other via edges.
But enough talk, let’s draw a simple painting. For the purpose of this exercise, let’s take as case study of exposing the dependencies between the namespaces (or packages) of a software system. One idea to do that is to visualize the situation as a graph in which namespaces are nodes and the dependencies between them are edges.
We start with a fresh canvas:
|view| view:=MOViewRenderernew. viewopen.
A fresh canvas is just too simple even for a beginner. So, let’s fill it with a node for each namespace from our model (in this case the model is provided by Moose):
It starts to get interesting. The next thing we need are the edges. We know that from each namespace we can obtain the providerNamespaces. Thus, we can create an edge for each such relationship:
Now, to make the drawing more relevant, let us layout the graph so that the namespaces that are not used by any other are at the top and those that use no other at the bottom. We simply use a dominanceTreeLayout:
To provide more context to the picture, we might also want to add more information on the nodes by mapping the number of classes, methods and lines of code on the width, height and color respectively:
Executing this code in a Smalltalk workspace reveals a visualization as shown below (provided you have a Moose model already loaded).
There are more visualization and interaction options offered by Mondrian, but we wanted to start with a small painting. Ok, maybe this painting was not that small, but it was rather simple. Once data was around and we knew what we wanted, creating the view was a few keystrokes away. In fact, these keystrokes are so few that it’s hard to imagine any excuse to continue to let graph data shapeless.
Proper decision making is based on accurate facts. The step of assessing the situation is especially important when we have to deal with large amounts of intricate data. In this case we need the help of a computer to crunch the data and to provide an overview that can answer our questions. But, how do we get to know what to instruct the computer to do?
I recently had the opportunity to work on a prototype to analyze a large system. The system was composed of several hundred subsystems, each being written in Java. However, the overall system was based on a custom engine that was put together via a multitude of custom configurations written in XML and in another custom language. Given the particularities of the project, a dedicated tool was needed to analyze it. The goal of this pilot project was to craft such a dedicated tool that would be able to provide an overview of these configurations and expose their various relationships.
After 10 days of work, a prototype emerged that offered several surprising new perspectives on the system. I created the tool using the infrastructure provided by Moose. It was an interesting experience compressed in a short amount of time, and I would like to summarize the process taken from the points of view of this particular episode.
The first step in any assessment is to figure out what are the important questions that should be answered from the data. This is never easy but it is crucial. Without it your answers might be correct, but they will not necessarily provide any value. However, in many cases people expect that you already know the important questions. This is especially true when you call yourself an assessment expert. But, just like an expert in software development masters the technological space but does not necessarily know apriori the exact requirements of a client, in the same way an assessment expert only knows how to recognize a relevant question when he sees it, but he help find a way to provide answers. You will never get the perfect overview, but it is critical that you try to learn the context as much as possible.
The second step of a data assessment is to get your hands on the data. This might look straightforward, but it does not come without pitfalls. A set of pitfalls that is not to be underestimated is human related. The first is the difficulty to actually get to see the data. Important data is typically guarded. The more important the data and the larger the organization, the more layers of guardians there are. The second difficulty is that even if you have the approval, it might still be hard to get to the actual data. This might happen because the approval still needs to get to the people that know how to provide you with data and these people should want to give it. But even if they do want to send you the data, they might not get you the right or the complete data. As the success of the project depends on it, it is in your interest to make sure that all the parties involve understand what is needed, and to double check the correctness of the data.
But, I digress. Suppose you do get to see the data. Still, what you will get is raw data. This kind of data is usually not available in the format best suited for analysis, and we typically want to build a model out of it that better accommodate our needs. For this, we need to a meta-model that defines the structure of the model.
In theory designing the meta-model looks straightforward: you look at the data and you recover the structure that best fits your analysis needs. In practice however, it’s anything but straightforward. To begin with, data is typically full of noise and you have to figure out what part of it is actually interesting for the problem at hand. To learn about the interesting parts, you have to know what the data is good for, and for this you need to know the reasons behind it. This typically implies reading documentation, interviewing people that use or create the data, and in triangulating these sources.
Once you understand the raw data, you have to map on it the questions you want to answer. You arm yourselves with questions like “What pieces of data do I need to answer my questions?” “How do I best correlate these pieces to get my answers?” and you move on to modeling them together with the data. The resulting meta-model can look like in the UML diagram to the right and it serves as a map that helps us navigate through our model. For example, it can document the path to take if we want to identify the dependencies between entities A and B.
Now that we have the meta-model, we need to populate the model with the actual data. For this, we need to process the data in some way. This might involve some form of parsing, mining, queries over a database, or accessing data through some other communication channels.
In my case I needed several parsers to accommodate the various formats, each of these being responsible for populating a part of the model. In particular, I built several XML parsers and one based on PetitParser, a very nice parsing infrastructure developed by Lukas.
Once the model is shaped in the way we want and is populated with the relevant data, we can start toying with questions. There are countless possibilities, each having strengths and weaknesses depending on the problem. We will get into these with another occasion.
In my project I needed to provide first an overview, and then to offer the possibility to learn more details about the interesting parts. Thus, I built a browser that combined queries for detecting high level dependencies, visualization for displaying them and interactive navigation for allowing the user to obtain more details. An example of how a dependencies map could look like is displayed to the right. For crafting the visualizations I used Mondrian, and for building the overall browser I used Glamour.
Of course, like any tool, this one too needed testing. This was done both via regression tests and via running it together with the user. Letting the user experience the tool and the results of the analysis has at least two benefits. First, it provides intuitive feedback regarding the accuracy and coverage of the analysis. Second, it provides and opportunity to educate him in how to interpret the results of the analysis.
This was a short overview. Now, imagine all these steps carried out in iterations, add several missteps due to incomplete available information, and you get an approximative picture of the overall process.
In conclusion, I would like to stress a couple of points:
Data assessment requires a driving set of questions and access to data.
Specific questions and custom data require specific answers provided by custom tools.
Building a data assessment tool is similar to developing any software project and thus appropriate attention and investment should be allocated to it.
Using a proper infrastructure can significantly decrease the overall cost.
Figuring out what data is important and how to expose it is not straightforward and requires dedicated training.
A good tool does not guarantee magic decisions, but it can provide accurate input for them.
Most of the animations were realized simply by using the "Magic Move" transition effect. While at the beginning I thought the name of the effect was a little boastful, I now find it quite appropriate.
Some nice examples of what can be done with Keynote and its magic moves, can be seen on mondaydots.com where Jeff Monday exploits a simple core idea: use a dot to present any piece of information and animate it to create meaning by offering a context. This simple idea leads to countless possibilities. The movie below illustrates the spirit:
Yet, even if a tool advertises "Magic", the magic never happens by itself. A tool is but an enabler. It is the human that needs to make the magic happen.
Posted by Tudor Girbaat 1 February 2010, 11:07 pm3 commentslink
Another year of many TED Talks has passed by. Just like last year, I list here my favorite talks split into several categories. For each I imposed on myself to limit the amounts of talks to one or two. This was a rather hard task, as the talks visibly (and audibly, for that matter) improved over the past years.
Several of this year’s talks came as a pleasant surprise. For this reason I added another category to my list: the surprising one. This year it features two talkssmall note:
small note Both talks surprised me in a very pleasant way, as I did not expect the speakers to deliver in the way they did. I discovered with this occasion that, in a way, I fell into the trap of the single story described so vividly by Chimamanda Adichie (even if the talk of Chimamanda Adichie did not quite make it in my above list, I still felt compelled to mention it somehow)
Posted by Tudor Girbaat 29 January 2010, 11:48 pm commentlink