The story of yesterday's weather

This story has to do with my view on what innovation is and how it happens. It also has to do with getting old and having memories to share.

Back in 2002, when I embarked on the PhD journey, I was exposed to a couple of topics. The one that captured my imagination was software evolution analysis. I did not know anything about it, but analyzing time sounded highly philosophical. And it was.

As expected when jumping in a new field, I started by reading papers. After a short while I set myself to implement one of the analyses I encountered. It was about determining the trend of a metric over time. I started typing, but I quickly realized that I was missing a concept to attach time related behavior to. I then went back to searching for papers that would describe what to do, but I could not really find any. So, I built it myself and I called it the history. It turned out that introducing this simple concept many analyses became easier to write, and the idea of modeling history as a first entity became the very center of my thesis.

During the same time, I stumbled across a paper that claimed that one heuristic to follow when understanding a software system is to start from the places that were modified last, the assumption being that these parts will be among those that will need changing next. This bugged me for a while. Somehow I felt like not quite agreeing (there is a story behind how I come to realize that I did not agree with the statement, but we leave it for another time). I was a novice and I could afford to be ignorant.

Thus, I proceeded with implementing an algorithm that would detect the probability of this heuristic. I named it Yesterday’s Weather. I managed to dug up one of the cleaned up versions of my original implementation. I listed it below.

Please, do not read it in details. Just look at it, and continue reading what comes. I promise there is a point.

yesterdayWeatherProbabilityWithTopPreviousWENM: topPreviousWENM
andTopCurrentENM: topCurrentENM
| currentVersions previousClassHistoriesSortedByWENM
yesterdayWeatherHits last2VersionsTopHistories last2Versions
last2HistoriesSortedByENM x valuesCount
previousVersionsTopHistories previousVersionsTopHistoriesNames
over |
currentVersions := OrderedCollection new.
currentVersions addLast: (self allVersionNames at: 1).

yesterdayWeatherHits := 0.

(2 to: self allVersionNames size) do: [: i |
self smelly: 'this algorithm is too big and complex'.

previousClassHistoriesSortedByWENM := (self classHistories
selectFromReferenceVersionCollection: currentVersions)
sortBy: [:a :b | a value getWENM >= b value getWENM].
currentVersions addLast: (self allVersionNames at: i).

previousVersionsTopHistories := OrderedCollection new.

x := previousClassHistoriesSortedByWENM first value getWENM.
valuesCount := 0.

previousClassHistoriesSortedByWENM do: [ :each |
(each value getWENM ~= x) ifTrue: [
valuesCount := valuesCount + 1. x:= each value getWENM].
(valuesCount < topPreviousWENM) ifTrue: [
previousVersionsTopHistories addLast: each]

last2VersionsTopHistories := OrderedCollection new.

last2Versions := OrderedCollection new.
last2Versions addLast: (self allVersionNames at: (i-1)).
last2Versions addLast: (self allVersionNames at: i).
last2HistoriesSortedByENM := (self classHistories
selectFromReferenceVersionCollection: last2Versions)
sortBy: [:a :b | a value getWENM >= b value getWENM].

x := last2HistoriesSortedByENM first value getENM.
valuesCount := 0.

last2HistoriesSortedByENM do: [ :each |
(each value getENM ~= x) ifTrue: [
valuesCount := valuesCount + 1. x:= each value getENM].
(valuesCount < topCurrentENM) ifTrue: [
last2VersionsTopHistories addLast: each]

previousVersionsTopHistoriesNames :=
previousVersionsTopHistories collect: [ :each |
each value name].
over := false.

last2VersionsTopHistories do: [:each |
((previousVersionsTopHistoriesNames includes:
(each value name)) and: [over not]) ifTrue: [
yesterdayWeatherHits := yesterdayWeatherHits + 1.
over := true].

^yesterdayWeatherHits/(self size - 1) asFloat.

I applied this algorithm on several systems, and it turned out I was right to doubt the heuristic. It only made sense in some situations. I was thrilled.

Then multiple things happened in a short amount of time.

First, I was highly annoyed by the look of the code. Look at it. Even if it is written in Smalltalk, and Smalltalk is a beautiful language, the code looks terrible. I tried to refactor it, but somehow it did not get better.

Second, I set myself to write a paper about it, when my professors came to me and said: you have to formalize it. "What do you mean?" I asked. "This is an algorithm, so you should be able to explain it to us mathematically," they said. I thought this was unnecessary, in particular given that I was always afraid of strange notations. After all, the code is a formalization in itself, and it does what it is supposed to do.

Finally, I had to give an internal presentation about it. I tried to explain what the algorithm does, but nobody seemed to quite get it. This is what bugged me the most. If they do not get it, why should my other peers?

So, I went back to the drawing board. Not to the code editor. To the drawing board. My target was to try to figure out a way to explain this piece to an outsider by using metaphors and visuals.

And so I did. Yesterday’s Weather boils down to picking a point in the past, and comparing what happened before that point with what happened after that point. More precisely, the original heuristic being that the parts that changed recently are relevant for the near future, we need to check whether those parts that were changed recently also got changed soon after the chosen point in time.

Graphically, it looks like this.


If we step back, what I needed was a simple intersection of two sets: the changed parts from the past and those from the future. So, in the end, the pseudocode looked like this:

 past:=all.topChanged(beginning, present)  
 future:=all.topChanged(present, end)

That’s it. If you apply this algorithm for each version and average the result you get a percentage that reflects the relevance of using the Yesterday’s Weather as a heuristic to predict future changes. You can find more details about the heuristic in the paper. And, yes, it turned out that I was right to doubt the heuristic because it does not necessarily apply to all systems.

Getting back to our story, armed with my new idea, I rewrote my code. It looked like this:

 yWFor: yesterdayCheck for: tomorrowCheck
^ ( 3 to: self versions size ) collect: [ :i |
| yesterday tomorrow |
yesterday := self
selectByExpression: yesterdayCheck
appliedFromVersionIndex: 1
toVersionIndexAndPresentInIt: i - 1.
tomorrow := self
selectByExpression: tomorrowCheck
appliedFromVersionIndexAndPresentInIt: i - 1
toVersionIndex: self versions size.
yesterday intersectWith: tomorrow ]
 yWFor: yesterdayCheck for: tomorrowCheck
| dailyYW hits |
dailyYW := self detailedYWFor: yesterdayCheck for: tomorrowCheck.
hits := dailyYW sum: [ :each |
each isEmpty
ifTrue: [0]
ifFalse: [1]].
^ hits / (self versions size - 2)

It also turns out that expressing this using a mathematical notation is straightforward, too.


In retrospect, it all seems straightforward. However, it was not as easy as it appears. To get this simple code to work, I had to extend the model with a couple of basic concepts. In the end, this proved to have a conceptual impact as it made the model highly generic. If you are interested in more details, you can read my PhD work.

As exciting as the technical details might be, the essence of the story is not of technical nature. The content was there all along. I could get the computer to compute what I wanted, although in a convoluted way. The main problem was that I could not explain it to other people. It was because of this that I could find what I did not understand. That is right. My understanding was incomplete and I could not explain my intuition. I had to go and rethink what I was doing. Once my ideas got clearer, the explanation got cleaner. In the end, I could explain Yesterday’s Weather in less than 5 minutes. 5 minutes.

My lesson:

Form without content is worthless. Content without form is pointless.

Why pointless? Because without form nobody understands the content, and thus, the package as a whole has no value.

Form is important, but it is often addressed at the end of the process. That’s too late because you will not get it right the first time. You need to test and iterate. And, here is the magic: To test your idea you need the form, and when the form gets clean, the content becomes simple. The act of creation must address both content and form at the same time. This, to me, is at the core of the innovation process.

Posted by Tudor Girba at 9 June 2012, 10:11 pm with tags representation, presentation, research link