Vincent Gable’s Blog

November 11, 2009

Just Look at it, Man!

Filed under: Bug Bite,Programming | , , , , , , , , ,
― Vincent Gable on November 11, 2009

You’re looking at Anscombe’s quartet: 4 datasets with identical simple statistical properties (mean, variance, correlation, linear regression); but obvious differences when graphed.

325px-Anscombe.svg.png


(via Best of Wikipedia)

Graphs aren’t a substitute for numerical analysis. Graphs are not a panacea. But they’re excellent for discovering patterns, outliers, and getting intuition about a dataset. If you never graph your data, then you’ve never really looked at it.

War Story

I was working on optimizing color correction, using SSE (high performance x86 instructions). One operation required division — an expensive operation for a computer. The hardware had a divide instruction, but sometimes using the Newton-Raphson method to do the division in software is faster. You never know until you measure.

While doing the measurement, I somehow got the crazy idea to try both: I’d already unrolled the inner loop so instead of repeating the divide or Newton’s Method twice, I’d do a divide and then use Newton’s Method for the next value. Strangely enough, this was faster on the hardware I was benchmarking than either method individually. Modern hardware is a complex and scary beast.

I was fortunate enough to have a suite of very good unit tests to run against my optimized code. But there was a caveat to testing correctness. Because computers don’t have infinitely precise arithmetic, two correct algorithms might give different answers — but if the numbers they gave were close enough to the infinitely precise answer (say a couple ulps apart) it was good enough. (We can only be exact within some Tolerance!) The tests cleared my hybrid divide/Newton-Raphson function: but we couldn’t use it, because it was fundamentally broken.

Even though the error was acceptably small, it had a nasty distribution. Using divide gave color values that were a bit too light. Doing a divide in software gave values that were a bit too dark. Individually these errors were fine. Randomly spread over the image they would have been fine. But processing every other pixel differently had the effect of adding alternating light/dark stripes! We see contrast, not absolute color, so the numerically insignificant error was quite visible. Worse still, bands of 1 pixel stripes combined to form a shimmering Moiré pattern. It was totally busted. Unusable.

This was all immediately obvious when the results of the color correction were “graphed”. Actually looking at the answer caught a subtle error that our suite of unit tests missed.

To be clear, more subjective graphical analysis is not a substitute for numerical analysis and data mining. But I believe in actually looking at your data at least once. A graph is a kind of end-to-end visualization of everything, and that has value. Graphs are a cheap sanity check — does everything look right? And sometimes, they can give you real insight into a problem.

May 9, 2009

Words Lie More Than Statistics

Filed under: Quotes | , , , ,
― Vincent Gable on May 9, 2009

Increasingly it seems, people throw up their hands, “graphs and statistics are all lies anyway!” and never deeply examine quantitative information. And that’s part of the reason why I can’t recommend The Visual Display of Quantitative Information enough.

For many people the first word that comes to mind when they think about statistical charts is “lie.” No doubt some graphics do distort the underlying data, making it hard for the viewer to learn the truth. But data graphics are no different from words in this regard, for any means of communication can be used to deceive. There is no reason to believe that graphics are especially vulnerable to exploitation by liars; in fact, most of us have pretty good graphical lie detectors that help us see right through frauds.

Much of twentieth-century thinking about statistical graphics has been preoccupied with the question of how some amateurish chart might fool a naive viewer. Other important issues, such as the use of graphics for serious data analysis, were largely ignored. At the core of the preoccupation with deceptive graphics was the assumption that data graphics were mainly devices for showing the obvious to the ignorant. It is hard to imagine any doctrine more likely to stifle intellectual progress in a field…

–Edward Tufte, The Visual Display of Quantitative Information, page 53.

You will be lied to more often, and more subtly, with words than with figures. But unlike empty words, the data behind a chart is verifiable, and can be objectively redrawn. As I see it, quantitative analysis is our best chance to reason truthfully and without ego. Sometimes infographics are a better tool than words, especially for summarizing large datasets objectively. If anything, we should be scared when there aren’t graphs and statistics.

April 1, 2009

Microsoft Excel Does Not Excel at Graphing

Filed under: Design,Quotes,Usability | , , , ,
― Vincent Gable on April 1, 2009

I gripe about Excel a lot, as we’re more or less forced to use it for data analysis in the intro labs (students who have taken the intro engineering course supposedly are taught how to work with Excel, and it’s kind of difficult to buy a computer without it these days, so it eliminates the “I couldn’t do anything with the data” excuse for not doing lab reports). This is a constant source of irritation, as the default settings are carefully chosen so as to make it difficult for students to do a good job of data presentation.

Now, you might be saying “Well, of course Excel isn’t appropriate for scientific data analysis. It’s not really for scientists, though.” Which is true, but here’s the thing: the things I’ve complained about here aren’t good for anything. The color schemes and axis settings lead to illegible plots no matter what sort of data you’re working with. And I’m completely at a loss as to the purpose of the “Line” plot, or making it difficult to find uncertainties in fitted quantitites.

Professor Chad Orzel, Why Does Excel Suck So Much?

There’s no question in my mind that a lot of serious analysis is done in(spite) Excel. I’ve worked with some very smart programmers, with PhDs in experimental science, who have “numerics” in their job description, and used Excel to make quick graphs.

The best solution I can recommend is reading The Visual Display of Quantitative Information. It’s probably the best guide to honestly presenting data graphically.

Unfortunately I don’t have a good recommendation for a better software program. The excellent redrawings at the chartjunk blog were done in Adobe Illustrator (more info in this comment). But Illustrator costs $599, and is a complex drawing program. Honestly the sticker price, and ease of use, have kept me from trying it.

What do you use to draw graphs?

Powered by WordPress