Data is only available about the past

“Data is only available about the past.”

-Clayton Christensen (link)

This is an obvious but fundamental limitation that we should not lose sight of! Despite the fact that we typically want to predict the future, the hard data all comes from the past.

One way to deal with this is to assume that what was true in the past will still be true in the future. For example: “If the weather is hot today, it will likely be hot tomorrow.” Of course, the farther into the future we go, the more likely it is that something will change. But the continuity assumption allows us to pretend that the past data also reflects the future. And in many cases, this turns out to be accurate.

The other way to deal with this fundamental limitation is to use the data to form theories of correlation and causality. This is what the scientific method is all about. It allows us to generalize from the specific data and say “any time this happens, this other thing will happen.” For example: “this configuration of high pressure systems will cause the temperature tomorrow to fall.”

In the former approach, the data analyst is very interested in specifics, such as outliers and numeric values.

  • “What is the temperature?”
  • “Are there any problems we need to fix?”
  • “Where are our best successes?”

In the latter approach, the analyst is more interested in correlations, patterns, differences and trends.

  • “What time of year is it usually hot?”
  • “What causes our successes?”
  • “What leads to problems?”

It seems likely that different data analysis tools are optimal for these different types of questions.

Racism is just a theory

Why is it so easy to jump to the conclusion that racists are bad people?

Isn’t that conclusion almost as narrow-sighted as racism itself?

I just listened to an episode of This American Life about a white supremacist speech writer who later changed his whole persona and published a best-selling book about tolerance for others and respect for nature. The radio show asserts that this writer “pulled a 180” – completely changed course.

I don’t see it that way at all. As truly terrible as racism is, the evidence is that racists tend to have just as good intentions as everyone else. The white supremacist speech writer believed that blacks and Jews were the cause of many social and political problems. It followed that the way to improve society and work for good was to promote segregation and white supremacy. The theory turned out to be wrong, but the underlying goal was simply to fix society’s problems, not to cause harm.

An old friend of this writer, who is also a white supremacist and southern conservative, was quoted on the show saying the book was not a change of course at all. To him, the book is all about the problems with big government and the importance of honoring the natural order of things.

The way I see it, the speech writer did eventually realize that the white supremacy theory was wrong. But this wasn’t a change to his underlying values. It was merely a change to one of the multitudes of theories he held – such as “things fall when you drop them” and “people enjoy receiving gifts”. However, there was so much tied up in this supremacy theory, socially and politically, that he felt the need to pretend to be a new person entirely.

Why is it so hard to believe that people can update their theories? If you have any doubt, just listen to the Silver Dollar episode of Love+Radio, where a black man befriends dozens of Ku Klux Klan members and gently, lovingly disproves their theory that black people are the problem. Through this process, many Klan leaders updated their theories, and as a result, entire branches of the Klan were quietly dismantled.

No one wants to be wrong! And very few want to be a bad person. If you treat people with the assumption that they are good, they will tend to prove you right. You just need to provide a graceful way to be wrong, so that everyone has the chance to reconsider and update their theories.

Metadata Visualization

There are at least two ways of interpreting a table of data.

Date Temperature Humidity
June 18 92 57
June 19 95 NULL
June 20 84 51

The first interpretation treats the table as a collection of facts about the world. For example, on June 18 the temperature was 92 degrees and the humidity was 57%. On June 19, the temperature was 95 degrees and humidity was unknown.

The second interpretation treats the table as a literal list of data points. For example, on June 18, someone recorded the temperature at 92 degrees and the humidity at 57%. On June 19, the humidity sensor was broken. The data is stored in a table with three columns. Before June 18, the data was being recorded in a different table.

In other words, we can focus on what the data says about the world, or we can focus on the data itself.

We can think of the data ephemerally as information, or we can think of it as a physical thing that exists in and of itself.

This is analogous to written language: a sentence or paragraph generally means something, but it also exists as physical letters and punctuation on the page.

The second interpretation is often called metadata: data about the data. How was it collected, by whom, for what purpose, and where and how is it stored? How accurate is it likely to be?

If we are very confident about the accuracy and relevance of the data, we can summarize and visualize it cleanly. We could show a line chart of temperature over time and start to draw conclusions about what the temperature trend means.

But if the accuracy and relevance is unknown, we need to take steps to better understand the metadata. How much data is there? Which parts are missing, or appear to be duplicated? Where did it come from? What metrics are most relevant?

Suppose the default behavior of a data analysis tool is to ingest your data and take you directly to a clean line chart. Is that convenient or misleading? Does that clean line chart imply that you are looking at truth, when in fact you may just be looking at data?

Can we assume that the line chart is about temperature, or should we emphasize that it shows data about temperature? What is the best way to communicate that distinction?

Swift

Apple announced a new programming language called Swift earlier this week at WWDC 2014. The focus during the keynote was ease of use, and indeed the language is incredibly exciting as a learning tool. But this is not a simplistic language. It is extremely powerful, extremely well crafted, and designed to replace Objective-C for professional software development. In many ways it feels like the next evolution in the line of C, C++, Obj-C, C#… but they ran out of “C” modifiers and instead called it Swift.

Swift can be easily adopted by software companies because it is interoperable with most existing code written in C, C++, and Objective-C. You don’t have to rewrite your app from scratch just to get started.

The developer tools team is also shipping a live coding environment inspired by Bret Victor. This is truly exciting to see, and I suspect they are only just getting started. This environment is not only useful for beginners, it will also change the way professional programming is done: instead of building and debugging entire apps, developers can prototype, explore, and debug individual modules interactively in the “playground”. The documentation also lives in this environment, so you can play with example code and see the results in real time.

I have a lot more to learn about Swift, but my initial impressions are that it has achieved the high praise of “obvious only in retrospect.” I suspect it will significantly influence the software community.