Playing around with MoMA data (Part 2: Getting less fancy)

Preamble
So the map was all well and good, but it wasn’t actually that easy to look at carefully. In some ways it was information overload – there are ~195 countries in the world (though somewhat less are represented in the dataset), which is really too much to look at at once. So for this post, I thought we could go backwards and simplify – look at the broadest, easiest ways to look at these data, and then slowly add more complicated views.

So I’m going to go ahead and load a few libraries, and then I’m going to load the data that we already manipulated in the last post (so we don’t have to go through all that meshugganeh again). This is before I merged it with the map shapefiles, so it’s just a dataframe with every piece of art for which we have nationality information (and pieces of art with multiple artist nationalities listed twice), as well as some information about when it was acquired.

########################
# SET-UP
#######################
library(plyr)
library(ggplot2)
library(scales)
library(tidyr)
library(gridExtra)
library(plotrix)
load("/Users/isaacmaddowzimet/imaddowzimet.github.io/CleanedMoMA.Rda")

Going simple
So the very easiest way to solve the “there are too many countries in the world” problem is to group them by region. We lose a lot of information this way, but it might be easier to see what is going on (and then we can always add more complexity later). I pulled together a crosswalk using UN regions and subregions, which is saved here, and then merged it with my dataset.

regions       <- read.csv("/Users/isaacmaddowzimet/imaddowzimet.github.io/datamisc/ContinentCrosswalk.csv")
dfwithregions <- join(dfFinal, regions, by='Country', type='left')

After I merge, I want to check that there aren’t any orphan countries, without regions assigned to them:

orphans <- table(dfwithregions[is.na(dfwithregions$Region)          # Tabulate if region is missing
                               & !is.na(dfwithregions$Country),     # and country is not missing 
                               c("Country")])
orphans[orphans>0]                                                  # show table (and suppress the empty cells)
## Holland  
##     3236

So it seems like the only problem is Holland.

To be completely honest, though… I’m cheating a bit, because I had a huge number of issues when I first did this merge. Countries are named extremely inconsistently between the datasets, and some countries in the MoMA dataset don’t exist anymore (like Yugoslavia). I fixed all of these issues in the crosswalk itself, but I wanted to show the issue with Holland because it was so silly, and so typical. Holland isn’t merging correctly because in my main dataset the word “Holland” has a space after it. That’s enough to drop more than 3,000 cases that I would have never noticed if I just plotted the data without doing these types of checks. THE MORAL: Keep track of your missing values like you would your firstborn, because they will absolutely wander off in the supermarket while you aren’t paying attention.

Luckily, this is an easy (albeit somewhat manual) fix:

dfwithregions$Region[dfwithregions$Country=="Holland "] <- "EUROPE"
dfwithregions$Subregion[dfwithregions$Country=="Holland "] <- "Western Europe"

Going simple (TAKE 2)

OK, with that little bit of data manipulation out of the way, let’s make some simple plots.

First, we can just look at the breakdown by region in simple bar chart.

regiontable             <- table(dfwithregions$Region) # Make table of frequencies
regiontabledf           <- as.data.frame(sort(regiontable, decreasing = T))          # convert to data frame
colnames(regiontabledf) <- c("Region", "Freq")
ggplot(data=regiontabledf, 
       aes(x=Region, y=Freq)) +
       geom_bar(stat='identity', fill='indianred1') +
       theme_bw()+
       scale_y_continuous(name="Number of artworks", labels = comma)+
       ggtitle("Number of artworks in MoMA permanent collection by nationality of artist, grouped by region")+
       theme(plot.title = element_text(hjust = 0.5, size=10))

plot of chunk unnamed-chunk-5

So, no surprise, most of the artworks in the permanent collection are from artists born in Europe and North America, and hardly any are from Africa or Oceania. But our research question was how acquisitions have shifted over time, so we really want to break these up into five year groups (like we did before, in the map). But how are we going to show change over time in such a simple graph? In other words, how can we add another dimension, in addition to region and frequency?

Interdimensional travel
Actually, nothing so complicated (yet). The easiest way to add time to a bar chart is actually just to make multiple bar charts, one for each five year period, and then put them next to each other (this is also often called “small multiples”). There are some ways to do this in ggplot fairly simply, but I actually don’t love the way they look, so I ended up doing this in base R instead.

One of the problems with doing these kinds of multiple bar charts is that they end up having a lot of repeated information in the axes and labels, so I tweaked the code a bit so that the y-axes only show up in the graphs on the left side, and dropped the x-axis labels completely, replacing them with a color legend at the bottom. Just for fun, I grabbed the color palette from this site, which creates palettes based on famous pieces of art. (Of course, there were hardly any artists on that site from countries outside of the US or Europe (and who also used conveniently contrasting colors), so instead I chose Bridget Riley, an English artist I had never heard of before, but whose colors I thought might work nicely.)

par(mfrow=c(5,4), mar=c(.2, .2, 1, .2), oma=c(1,2,2,0), family='serif') # I have to futz with the parameters to get the graphs in columns and rows nicely spaced, and to put some white space around the plot area 
 
# Define "Bridgette Riley"" color palette
riley <- c("#67C1EC", "#DE6736", "#7BBC53", "#FAB9AC", "#E6B90D",  "lightgrey" ) # These are just hexadecimal codes for the colors
 
# I set the order of countries manually this time (instead of sorting by frequency), because we want them to stay consistent across the graphs 
dfwithregions$Region<- factor(dfwithregions$Region, 
                              levels(dfwithregions$Region)[c(5, 3, 4, 2, 1, 6)])  
 
# And again split into datasets for each five year period
subsets<-split(dfwithregions, dfwithregions$fiveyear, drop=TRUE) 
 
# And then graph! 17 times...
for (i in 1:17) {
  regiontable             <- table(subsets[[i]]$Region)                                  # Make table of frequencies
  barplot(regiontable, xaxt='n', ylim=c(0,13000), col=riley                              
        , yaxt= if ((i+3)%%4!=0) 'n', yaxp=c(0, 12000, 3))                             # this line a bit tricky, but it's a way to make R only draw y axes for   the graphs in column 1. It does this with a bit of math - if the iteration we are on +3 is not evenly divisible by 4, we hide the y-axis)
  title(paste(1925+(i*5), "-", 1930+(i*5), sep=""), line=-1.5)                           # Add title to each graph (we use a bit more math to get a range, and iterate through the titles)
  axis(side=1, at=c(0, 7.5), labels=FALSE)                                              # the x-axis tick at 7.5 won't appear since it is out of range, but I learned the hard way that if there's no upper tick the axis just dissapears
}
 
# All this below is just to get the legend positioned correctly, particularly the horizontal spacing
  # First line of legend
legendtexttop=c("North Am.", "Eur.", "LAC")
xcoords <- c(0, 4, 6.5)                                       # This determines the spacing between the legend items
secondvectortop <- (1:length(legendtexttop))-1                # But you need to scale them by the length of the text 
textwidthstop <- xcoords/secondvectortop                      # (not 100% sure why this works, but found the code online)
textwidthstop[1] <- 0                                         # We set the first text width to 0 (see parenthetical above)
legend(10,10000, legendtexttop, fill=riley[1:3], xpd=NA,      # (xpd=NA lets us put the legend outside of the plot area, in the outer margins)
       horiz=TRUE, cex=2.2, text.width = textwidthstop, bty='n')
 
  # Second line of legend
legendtextbottom=c("Asia", "Africa", "Oceania")
legend(10,6000, legendtextbottom, fill=riley[4:6], xpd=NA,   # xpd=NA lets us put the legend outside of the plot area, in the outer margins
       horiz=TRUE, cex=2.2, text.width = textwidthstop, bty='n') # Luckily we can just use the same text width here, since we want them to align
 
# Add title, so people know what the graph is all about.
title("Number of pieces acquired by region, 1930-2015", outer=TRUE, cex.main=2)

plot of chunk unnamed-chunk-6-2

So this graph is…OK. But it really only gets across two points. The first is that the vast majority of acquisitions in any five year period are from North Ameria and Europe (which we knew already, but this does make the point fairly starkly, given that pretty much the only colors that even show up are blue and orange). The second is a bit subtler. If we look at how the height of the bars vary, we can also see that the absolute level of acquisitions has some interesting peaks and valleys as well. In the early 1960s, for example, MoMA seems to have gone on an acquisition spree (or been donated a huge collection) - it gets over 12,000 pieces from Europe alone. And looking at the whole chart, it seems like MoMA has been steadily increasing the numbers of pieces it acquires over time - though this is a little bit harder to see.

The ups and downs of the absolute numbers make the relative changes between regions almost impossible to pick out, however. It also forced us to increase the length of the y-axis scale to accomodate the enormous amount of pieces acquired in the early 60s, which ends up putting a ton of white space at the top of the other small graphs (in effect, making them even smaller).

To deal with this, I constructed a similar set of small multiples, but in place of graphing the number of pieces, graphed the proportion from each region instead:

# This code is essentially identical to what we did above, so I've taken out the comments
par(mfrow=c(5,4), mar=c(.2, .2, 1, .2), oma=c(1,2,2,0), family='serif')
for (i in 1:17) {
  regiontable             <- prop.table(table(subsets[[i]]$Region))*100              # Make table of proportions instead of counts (I multiply by 100 because I hate decimals)
  log(regiontable)
  barplot(regiontable, xaxt='n', ylim=c(0,100), col=riley
        , yaxt= if ((i+3)%%4!=0) 'n', yaxp=c(0, 100, 2))                           
  title(paste(1925+(i*5), "-", 1930+(i*5), sep=""), line=-1.5)
  axis(side=1, at=c(0, 7.5), labels=FALSE) 
}
 
# Add legend (we've already specified the values needed in the last code chunk)
legend(10,60, legendtexttop, fill=riley[1:3], xpd=NA,     
       horiz=TRUE, cex=2.2, text.width = textwidthstop, bty='n')
legend(10,40, legendtextbottom, fill=riley[4:6], xpd=NA,     
       horiz=TRUE, cex=2.2, text.width = textwidthstop, bty='n')
 
# Title graph
title("Percent of pieces acquired by region, 1930-2015", outer=TRUE, cex.main=2)

plot of chunk unnamed-chunk-7

This is slightly more interesting, though not wildly different. We can see a bit more of the shift back towards American artists in the 70s, 80s and 90s, and can also see that there were some acquisitions from artists of Latin American origin in the 30s and 40s, but we are really running into the same problem that we always have with these data – that they are so heavily skewed towards North America and Europe that it is hard to see any kind of relative change in the other regions.

Luckily, we have a solution for this, if we’re willing to sacrifice a bit of interpretability - which is to plot everything on a log scale. Logging the values smushes everything together, which makes changes in the smaller values easier to see; we just have to remember that the distances between regions are really much larger than what the graphs seem to show. Again, the code below is pretty much identical (except for some positioning of the legend), so I’ve stripped out all of the comments:

par(mfrow=c(5,4), mar=c(.2, .2, 1, .2), oma=c(1,2,2,0), family='serif')
 
for (i in 1:17) {
  regiontable             <- table(subsets[[i]]$Region)                                
  barplot(log(regiontable), xaxt='n', ylim=c(0,12), col=riley    # This is the only line that changed
        , yaxt= if ((i+3)%%4!=0) 'n', yaxp=c(0, 10, 2))                                               
  title(paste(1925+(i*5), "-", 1930+(i*5), sep=""), line=-1.5)
  axis(side=1, at=c(0, 7.5), labels=FALSE) 
}
# Add legend 
legend(10,8, legendtexttop, fill=riley[1:3], xpd=NA,     
       horiz=TRUE, cex=2.2, text.width = textwidthstop, bty='n')
legend(10,6, legendtextbottom, fill=riley[4:6], xpd=NA,     
       horiz=TRUE, cex=2.2, text.width = textwidthstop, bty='n')
 
# Title graph
title("Number of pieces acquired by region (natural log), 1930-2015", outer=TRUE, cex.main=2)

plot of chunk unnamed-chunk-8

Now this is much better. Now we can see a lot of the trends we picked out before in our map, and in bar chart form, can actually see them much more clearly. We see that there were some acquisitions from Latin American artists pretty much from the beginning, and that there started to be acquisitions of pieces from artists from Asia starting in the 50s. Meanwhile, acquisitions of pieces by artists from any country in Africa (the yellow bars) really don’t start appearing in any large numbers until the late 1990s.

It’s important to keep in mind looking at this graph that these are exponential differences - in the last chart (for 2010-2015), for example, it looks like there were twice as many pieces acquired from artists of African nationality than artists from countries in Oceania - but in reality, there were 10 times as many (102 vs. 12). And as you get higher up the log scale, the differences become even larger – while the bar for Africa looks like it is just under half of the size of the bar for North America, it represents a more than 70-fold difference (102 vs. 7444). I still think logging is the right solution when looking at trends over time for this kind of skewed data, since you really can see changes with a great deal of detail, but plotting things this way does really present a communication problem that I’m not sure I’ve ever seen a good solution for. If you presented this chart by itself, it would be easy for someone to walk away and think that MoMA now acquires pieces done by artists from all of the world, with only a slight bias towards artists of American or European origin. It’s only when looking at it in concert with the other charts that the full story - both the magnitude of the differences as well as the relative change - gets clearly communicated.

Making things a bit more complicated

Probably we should stop at the small multiples we graphed above - they’re relatively clear, and there’s nothing wrong with bar charts. But because I can’t stop won’t stop, I also wanted to try out some stacked area graphs to see if I could get something with a bit more visual oomph, and which fit into a smaller space. To do so, I used the stackpoly command from the plotrix package, which is awesome and easy to use, and kept everything at a log scale:

par(family='serif', oma=c(0,0,0,2))
 
# In order to make a stacked area chart, we need to put the logged frequencies into a matrix
stackedmatrix <- matrix(NA, 17,6)            # With 17 rows (for each five year period), and 6 columns (for each region)
for (i in 1:17) {
  stackedmatrix[i,]             <- log(table(subsets[[i]]$Region))            
  
}
# One little fix:
stackedmatrix[stackedmatrix=="-Inf"] <- 0   # When we log 0s, they become undefined, so we need to set them back to 0.  
 
# create x axis labels
xaxislabels <- seq(from=1935, to=2015, by=5)
 
# Chart!
stackpoly(stackedmatrix, stack=TRUE, xaxlab=xaxislabels, ylim=c(0,40), col=riley, axis4=FALSE)  
 
# Title!
title("Number of pieces acquired by region (natural log), 1930-2015")
 
# Legend!
legend(3.5,-2, legendtexttop, fill=riley[1:3], xpd=NA,     
       horiz=TRUE, cex=1, text.width = textwidthstop, bty='n')
legend(3.5,-3, legendtextbottom, fill=riley[4:6], xpd=NA,     
       horiz=TRUE, cex=1, text.width = textwidthstop, bty='n')

plot of chunk unnamed-chunk-9 When I first saw this graph, I liked it a lot – it seems like you mantain the pattern of relative shifts, but can also see much more clearly that the absolute number of acquisitions has increased over time. It’s harder to see the precise values, but as an overall story, it’s not bad.

The problem is that the story of overall increase is also almost completely wrong. This is because of something mathy - the sum of logged values (which is what we are seeing here), isn’t the same thing as the log of the sums. For example, the total number of pieces acquired in 2015 was 16,401: the log of that number is ~10. But what we see on the graph instead is a value on the log scale closer to 40 (the sum of the logged value for each region). Ugh. Math. If we were to actually plot it correctly, we’d see that there has been a bit of an increase, but nothing drastic (the peak is actually in the 1960s). Ah, the magic of the log scale.

So we may need to forget about trying to look at absolute levels, and instead just focus in on relative shifts - in other words, just look at distributions over time. I’m actually not sure what I did next works out mathematically (I’m actually fairly sure it doesn’t), but the result was so pretty that I thought it was worth including here. What I did was take the logged values for each 5-year period, and standardized them so that the sum of the logs always add up to the same number (10 in this case); what I ended up with was this piece of art:

par(family='serif', oma=c(0,0,0,2))
 
# Scale matrix 
stackedmatrixscaled  <- t(apply(stackedmatrix, 1, function(x)((10/sum(x))*x)))
 
# Chart!
stackpoly(stackedmatrixscaled, stack=TRUE, xaxlab=xaxislabels, ylim=c(0,10), col=riley, axis4=FALSE)  
 
# Title!
title("Distribution of pieces acquired by region (natural log), 1930-2015")
 
# Legend!
legend(3.5,-0.6, legendtexttop, fill=riley[1:3], xpd=NA,     
       horiz=TRUE, cex=1, text.width = textwidthstop, bty='n')
legend(3.5,-1, legendtextbottom, fill=riley[4:6], xpd=NA,     
       horiz=TRUE, cex=1, text.width = textwidthstop, bty='n')

plot of chunk unnamed-chunk-10

Which coincidentally, ends up looking like some of Bridgette Riley’s pieces

Written on February 3, 2017