Playing around with MoMA data (Part 2: Getting less fancy)
Preamble
So the map was all well and good, but it wasn’t actually that easy to look at carefully. In some ways it was information overload – there are ~195 countries in the world (though somewhat less are represented in the dataset), which is really too much to look at at once. So for this post, I thought we could go backwards and simplify – look at the broadest, easiest ways to look at these data, and then slowly add more complicated views.
So I’m going to go ahead and load a few libraries, and then I’m going to load the data that we already manipulated in the last post (so we don’t have to go through all that meshugganeh again). This is before I merged it with the map shapefiles, so it’s just a dataframe with every piece of art for which we have nationality information (and pieces of art with multiple artist nationalities listed twice), as well as some information about when it was acquired.
Going simple
So the very easiest way to solve the “there are too many countries in the world” problem is to group them by region. We lose a lot of information this way, but it might be easier to see what is going on (and then we can always add more complexity later). I pulled together a crosswalk using UN regions and subregions, which is saved here, and then merged it with my dataset.
After I merge, I want to check that there aren’t any orphan countries, without regions assigned to them:
So it seems like the only problem is Holland.
To be completely honest, though… I’m cheating a bit, because I had a huge number of issues when I first did this merge. Countries are named extremely inconsistently between the datasets, and some countries in the MoMA dataset don’t exist anymore (like Yugoslavia). I fixed all of these issues in the crosswalk itself, but I wanted to show the issue with Holland because it was so silly, and so typical. Holland isn’t merging correctly because in my main dataset the word “Holland” has a space after it. That’s enough to drop more than 3,000 cases that I would have never noticed if I just plotted the data without doing these types of checks. THE MORAL: Keep track of your missing values like you would your firstborn, because they will absolutely wander off in the supermarket while you aren’t paying attention.
Luckily, this is an easy (albeit somewhat manual) fix:
Going simple (TAKE 2)
OK, with that little bit of data manipulation out of the way, let’s make some simple plots.
First, we can just look at the breakdown by region in simple bar chart.
So, no surprise, most of the artworks in the permanent collection are from artists born in Europe and North America, and hardly any are from Africa or Oceania. But our research question was how acquisitions have shifted over time, so we really want to break these up into five year groups (like we did before, in the map). But how are we going to show change over time in such a simple graph? In other words, how can we add another dimension, in addition to region and frequency?
Interdimensional travel
Actually, nothing so complicated (yet). The easiest way to add time to a bar chart is actually just to make multiple bar charts, one for each five year period, and then put them next to each other (this is also often called “small multiples”). There are some ways to do this in ggplot fairly simply, but I actually don’t love the way they look, so I ended up doing this in base R instead.
One of the problems with doing these kinds of multiple bar charts is that they end up having a lot of repeated information in the axes and labels, so I tweaked the code a bit so that the y-axes only show up in the graphs on the left side, and dropped the x-axis labels completely, replacing them with a color legend at the bottom. Just for fun, I grabbed the color palette from this site, which creates palettes based on famous pieces of art. (Of course, there were hardly any artists on that site from countries outside of the US or Europe (and who also used conveniently contrasting colors), so instead I chose Bridget Riley, an English artist I had never heard of before, but whose colors I thought might work nicely.)
So this graph is…OK. But it really only gets across two points. The first is that the vast majority of acquisitions in any five year period are from North Ameria and Europe (which we knew already, but this does make the point fairly starkly, given that pretty much the only colors that even show up are blue and orange). The second is a bit subtler. If we look at how the height of the bars vary, we can also see that the absolute level of acquisitions has some interesting peaks and valleys as well. In the early 1960s, for example, MoMA seems to have gone on an acquisition spree (or been donated a huge collection) - it gets over 12,000 pieces from Europe alone. And looking at the whole chart, it seems like MoMA has been steadily increasing the numbers of pieces it acquires over time - though this is a little bit harder to see.
The ups and downs of the absolute numbers make the relative changes between regions almost impossible to pick out, however. It also forced us to increase the length of the y-axis scale to accomodate the enormous amount of pieces acquired in the early 60s, which ends up putting a ton of white space at the top of the other small graphs (in effect, making them even smaller).
To deal with this, I constructed a similar set of small multiples, but in place of graphing the number of pieces, graphed the proportion from each region instead:
This is slightly more interesting, though not wildly different. We can see a bit more of the shift back towards American artists in the 70s, 80s and 90s, and can also see that there were some acquisitions from artists of Latin American origin in the 30s and 40s, but we are really running into the same problem that we always have with these data – that they are so heavily skewed towards North America and Europe that it is hard to see any kind of relative change in the other regions.
Luckily, we have a solution for this, if we’re willing to sacrifice a bit of interpretability - which is to plot everything on a log scale. Logging the values smushes everything together, which makes changes in the smaller values easier to see; we just have to remember that the distances between regions are really much larger than what the graphs seem to show. Again, the code below is pretty much identical (except for some positioning of the legend), so I’ve stripped out all of the comments:
Now this is much better. Now we can see a lot of the trends we picked out before in our map, and in bar chart form, can actually see them much more clearly. We see that there were some acquisitions from Latin American artists pretty much from the beginning, and that there started to be acquisitions of pieces from artists from Asia starting in the 50s. Meanwhile, acquisitions of pieces by artists from any country in Africa (the yellow bars) really don’t start appearing in any large numbers until the late 1990s.
It’s important to keep in mind looking at this graph that these are exponential differences - in the last chart (for 2010-2015), for example, it looks like there were twice as many pieces acquired from artists of African nationality than artists from countries in Oceania - but in reality, there were 10 times as many (102 vs. 12). And as you get higher up the log scale, the differences become even larger – while the bar for Africa looks like it is just under half of the size of the bar for North America, it represents a more than 70-fold difference (102 vs. 7444). I still think logging is the right solution when looking at trends over time for this kind of skewed data, since you really can see changes with a great deal of detail, but plotting things this way does really present a communication problem that I’m not sure I’ve ever seen a good solution for. If you presented this chart by itself, it would be easy for someone to walk away and think that MoMA now acquires pieces done by artists from all of the world, with only a slight bias towards artists of American or European origin. It’s only when looking at it in concert with the other charts that the full story - both the magnitude of the differences as well as the relative change - gets clearly communicated.
Making things a bit more complicated
Probably we should stop at the small multiples we graphed above - they’re relatively clear, and there’s nothing wrong with bar charts. But because I can’t stop won’t stop, I also wanted to try out some stacked area graphs to see if I could get something with a bit more visual oomph, and which fit into a smaller space. To do so, I used the stackpoly command from the plotrix package, which is awesome and easy to use, and kept everything at a log scale:
When I first saw this graph, I liked it a lot – it seems like you mantain the pattern of relative shifts, but can also see much more clearly that the absolute number of acquisitions has increased over time. It’s harder to see the precise values, but as an overall story, it’s not bad.
The problem is that the story of overall increase is also almost completely wrong. This is because of something mathy - the sum of logged values (which is what we are seeing here), isn’t the same thing as the log of the sums. For example, the total number of pieces acquired in 2015 was 16,401: the log of that number is ~10. But what we see on the graph instead is a value on the log scale closer to 40 (the sum of the logged value for each region). Ugh. Math. If we were to actually plot it correctly, we’d see that there has been a bit of an increase, but nothing drastic (the peak is actually in the 1960s). Ah, the magic of the log scale.
So we may need to forget about trying to look at absolute levels, and instead just focus in on relative shifts - in other words, just look at distributions over time. I’m actually not sure what I did next works out mathematically (I’m actually fairly sure it doesn’t), but the result was so pretty that I thought it was worth including here. What I did was take the logged values for each 5-year period, and standardized them so that the sum of the logs always add up to the same number (10 in this case); what I ended up with was this piece of art:
Which coincidentally, ends up looking like some of Bridgette Riley’s pieces