Playing around with MoMA data (Part I: Mapping)
One of the datasets that I first played around with when I was sloooowly learning to use R was a really beautiful dataset that MoMA compiled and made publically available on Github, which has selected metadata on all 131,266 pieces in their permanent collection. It appealed to me because it looked like a slightly tricky dataset to work with – with variables in a lot of formats, and with kinds of data I wasn’t used to working with – but not so tricky that I wouldn’t get anywhere.
I’m mainly interested in R as a data exploration and visualization tool, since I still do much of my “serious” work in Stata. Because of this, I’ve been using the dataset as a way to experiment with different ways of visualizing data – both for myself, to understand the data better (“exploratory” visualization), and for telling a specific story to others (“explanatory” visualization). At first I thought I would just settle on one visualization, and share that here, but I actually haven’t been particularly happy with any of them. Instead, I thought what might be nice would be to do a series of posts on different ways of visualizing the same data, all motivated by the same central research question. Since this is the first one, I’m also going to share some of my struggles with manipulating the data in the first place - which might make this post a bit long, but feel free to skip to the parts that interest you the most.
I also want to acknowledge that a lot of other people have used this data to do all sorts of interesting things (and maybe the very things I’m doing here!). I’m going to try to link to them when I can, but I’m sure I’ll have missed some things; feel free to let me know if I have. So, without further ado:
The Research Question
It is I think pretty well known that many museums’ permanent collections, including MoMA’s, tend to be skewed towards ‘western’ artists, because of historical inequities in what was (or is) considered notable or worthy art (and in the case of MoMA, probably what was/is considered “modern” art. I’m by no means an art historian, but these seem like safe bets). What I’m interested in, however, is how this has shifted over time, as art trends and social contexts have changed. There’s been at least one very nice visualization that has looked at exhibitions by artist’s nation of origin over time, but I thought it would be interesting to instead look at the museum as a collector, whose permanent collection acquisitions reflect institutional priorities and larger sociopolitical trends. Luckily, the MoMA data has information for almost every piece in their permanent collection on who the artist(s) were, their countries of origin (though there were complications with this), and the date the piece was first acquired.
The Visualization
For those less interested in the code/process, I thought I would put the “final” visualization up front (for those interested in code/process, welcome! You will find it in excruciating detail in the next section). For this first crack at visualizing the data, it seemed like it made sense to show countries of origin on a world map, with shading related to how many pieces acquired in a given year were made by an artist with that nationality. Since I wanted to show how this changed over time, I broke the data into 5 year ‘chunks’ - so the chunk “1931-1935” represents the acquisitions made by the museum in the early 30s, the chunk 1936-1940 represents the same but for the late 30s, and so on. And because I am a child of the internet age, I wanted to make it a GIF.
Now this may go a bit fast, but generally, what you can see (besides the fact that the vast majority of acquisitions in every period are from American artists), is that the MoMA started to acquire more art from Latin American artists starting in the late 1930s, and that in the 1960s, started acquiring much more art from countries in Asia. It’s not really until the 2000s that MoMA started acquiring meaningful amounts of art from any countries in Africa, and even then, the numbers are fairly limited.
There are quite a few more notes about the process and the data below, but generally, I think this visualization approach, while it looks cool, is actually a bit hard to follow. I don’t know that we need this level of detail, for one, and the point that it’s trying to make is a bit lost. In future posts, I’m going to explore some less fancy ways to show some of the same trends; for now, all of you code lovers out there see below:
Struggling With the Data
As you’ll see, this data was a pain to work with, which led me to load/install so many packages that I’m not actually sure which ones were even necessary in the end. I’m putting them all here, for completeness’ sake, but it’s possibly you don’t need all of them - at some point I’ll try to go through and remove the unecessary ones.
########################
# SET-UP
#######################
library(readr)
library(stringr)
library(gdata)
library(plyr)
library(maps)
library(maptools)
library(rgdal)
library(ggplot2)
library(broom)
library(animation)
library(RCurl)
library(anytime)
gpclibPermit() # This is a command that needs to be run to get rgdal to work properly (at least for me)
The data is available as a csv, so I just load it directly from GitHub. The benefit of this is that as MoMA improves the data, the results might improve; the disadvantage is that if they change any of their variable names all of my code might break. C’est la vie, and all that.
# Load data. I have to use the Rcurl package here because R doesn't like HTTPS links
my.url<- getURL("https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv")
df <- read.csv(text=my.url)
The MoMA data has nationality as a field already (!) which I got very excited about at first. But then I looked at it.
head(df[,4:5])
## ArtistBio Nationality
## 1 (Austrian, 1841â\u0080\u00931918) (Austrian)
## 2 (French, born 1944) (French)
## 3 (Austrian, 1876â\u0080\u00931957) (Austrian)
## 4 (French and Swiss, born Switzerland 1944) ()
## 5 (Austrian, 1876â\u0080\u00931957) (Austrian)
## 6 (French and Swiss, born Switzerland 1944) ()
Just from these top rows, it’s clear it has issues. In particular, it seemed to have problems with cases where the artists had dual nationalities. In some other cases (not shown here), it just chose one, seemingly arbitrarily, which I didn’t want either.
The artist bio field seemed like a better shot, though a bit harder to work with. The dates were a bit garbled but I wasn’t interested in those, and it seemed like it would be possible to extract the nationalities. What’s more, there were only 4,656 pieces with missing data, which isn’t bad out of 131,266 pieces overall (4%).
To actually get the nationalities out of the artist bio field, however, I had to use the grep function, which I had never used before (as will be crystal clear from the code below). The rough idea of the code is to cycle through a list of nationalities, and for each, extract the pieces which had that nationality mentioned in the artist bio section. I would then end up with a set of dataframes (one for each nationality), with all of the pieces that were created by an artist of that nationality. A key thing to note here is that this means that pieces are double counted if the artists have dual nationalities or if there are multiple artists from different countries. I actually decided this is what I wanted – if a piece is by an artist who is both French and Algerian, I don’t want to choose which country the piece will be attributed to – but it is something to keep in mind when interpreting the results. It’s also a good example of the kinds of substantive data choices that you have to make even for a simple analysis, which aren’t often as well documented as they should be in the academic literature (which is why code sharing is great!).
The code below does what I just described; it isn’t the most elegant, so if anyone has better ideas for how to do this, I would love to hear.
# read in table of nationalities
# (this is just an excel file with country names in one column, and nationality names in the second)
my.url <- getURL("https://raw.githubusercontent.com/imaddowzimet/Misc/master/Nationalities%20Country%20Crosswalk.csv")
nationalities <- read.csv(text=my.url, header=FALSE)
# the code below creates a list of data frames, one for each nationality, which contain all the pieces of artists of that nationality
# (artists with dual or multiple nationalities will appear in multiple data frames, which is what we want for now.)
df$Nationality <- NA # drop the nationality info that we are not using
numrows <- dim(nationalities)[1] # count nationalities
mynationalitysubsets <- list() # initialize list
for (i in 1:numrows) { # for each nationality
# check the "artist bio" section, and if it has a nationality that matches the nationality list, select it for the dataset
mynationalitysubsets[[i]] <-
df[grep(paste0(nationalities$V2[i]," ","|",nationalities$V2[i],","),df$ArtistBio),]
# if the dataset we just created is not empty
if (nrow(mynationalitysubsets[[i]])>0) {
# then create a variable "nationality" with the corresponding nationality
mynationalitysubsets[[i]]$Nationality <- as.character(nationalities$V2[i])
}
}
# We now have a list of datasets; one for each nationality.
# Now we are going to append all of the datasets into one giant one
dfFinal <- mynationalitysubsets[[1]] # assign first dataset to dfFinal
for (i in 2:numrows) {
dfFinal <- rbind(dfFinal,mynationalitysubsets[[i]] ) # append the rest
}
So now I have one big dataset, with each piece either taking up one or multiple rows (in the case of dual nationalities/multiple artists). As a gut check, I can look at the N of this new dataset:
dim(dfFinal)[1]
## [1] 124934
Hmm. I started out with an N of 131,266, of which 4,656 had missing data for the artists bio field. 131,266 - 4,656 = 126,610, but then I know some pieces are double counted, so I should have more than that number. Instead, I have less. Probably what happened is that some of the artist’s bio fields had information but not valid nationalities, so they weren’t initially captured in my missing data check; since I don’t have a good way to impute these though, and the drop in size doesn’t seem alarming, I proceed onwards and upwards.
Now I have nationalities but not countries, so I need to merge those in if I want to do anything with maps.
##Merge in country names
nationalities <- rename(nationalities, c("V2"="Nationality","V1"="Country" )) # rename variables in nationality/country list
nationalities <- nationalities[!duplicated(nationalities$Nationality), ] # remove duplicate nationalities
nationalities$Country[nationalities$Country=="Dominica"] <- "Dominican Republic" # Fix an error in spreadsheet (oops!)
orig.length<- dim(dfFinal) # This will be used to make sure the join is working correctly
dfFinal <- join(dfFinal, nationalities, by='Nationality', type='left', match='all') # Merge datasets
orig.length[1]== dim(dfFinal)[1] # Check we didn't lose any rows
dfFinal$Country<- as.factor(dfFinal$Country) # change nationality to factor variable
So now we can actually start looking at what we have! I know that eventually I want to look at artists’ country of origin by date of acquisition, but I tend to always just want to do a one way tabulation of my variable to get a sense of what I have, and make sure that I haven’t made some horrible, horrible mistake.
# create a table of overall counts by nationality
mytable <- table(dfFinal$Country) # create table of counts
mytable # print table of counts! This is going to be wayyy too big.
##
## Afghanistan Albania Algeria
## 1 24 6
## Andorra Angola Argentina
## 0 1 0
## Armenia Australia Austria
## 2 248 1107
## Azerbaijan Bahamas, The Bahrain
## 1 3 0
## Bangladesh Barbados Belarus
## 0 0 0
## Belgium Belize Benin
## 1507 0 0
## Bhutan Bolivia Bosnia and Herzegovina
## 0 4 10
## Botswana Brazil Brunei
## 0 732 0
## Bulgaria Burkina Faso Burundi
## 11 0 0
## Cambodia Cameroon Canada
## 8 13 836
## Cape Verde Islands Chad Chile
## 0 0 534
## China Colombia Congo, Dem. Rep.
## 211 721 3
## Costa Rica Croatia Cuba
## 64 157 190
## Cyprus Czech Republic Denmark
## 1 908 623
## Djibouti Dominican Republic Ecuador
## 0 0 0
## Egypt, Arab Rep. El Salvador Eritrea
## 59 0 0
## Estonia Ethiopia Fiji
## 1 1 0
## Finland France Gabon
## 231 22774 0
## Gambia, The Georgia Germany
## 0 17 10018
## Ghana Greece Grenada
## 4 54 0
## Guatemala Guinea Guyana
## 68 0 1
## Haiti Honduras Hungary
## 17 0 776
## Iceland India Indonesia
## 51 100 2
## Iran, Islamic Rep. Iraq Ireland
## 45 4 52
## Italy Jamaica Japan
## 2938 0 2719
## Jordan Kazakhstan Kenya
## 0 0 12
## Korea, Dem. Rep. Korea, Rep. Kuwait
## 0 0 1
## Lao PDR Latvia Lebanon
## 0 73 50
## Liberia Libya Lithuania
## 0 0 73
## Macedonia, FYR Madagascar Malawi
## 1 0 0
## Malaysia Maldives Mali
## 7 0 10
## Malta Mauritania Mauritius
## 0 2 0
## Mexico Moldova Monaco
## 1305 0 0
## Mongolia Montenegro Morocco
## 0 0 5
## Mozambique Myanmar Namibia
## 1 0 1
## Nepal Netherlands Nicaragua
## 0 1659 1
## Niger Nigeria Norway
## 0 10 183
## Oman Pakistan Panama
## 0 47 4
## Papua New Guinea Paraguay Peru
## 0 3 107
## Philippines Poland Portugal
## 0 488 169
## Qatar Romania Russian Federation
## 0 71 1861
## Rwanda Saudi Arabia Senegal
## 1 1 3
## Serbia Seychelles, the Sierra Leone
## 15 0 0
## Singapore Slovak Republic Slovenia
## 2 77 27
## Somalia South Africa Spain
## 0 394 3118
## Sri Lanka Sudan Suriname
## 0 2 0
## Swaziland Sweden Switzerland
## 0 457 2325
## Syrian Arab Republic Taiwan Tajikistan
## 1 8 1
## Tanzania Thailand Togo
## 1 33 0
## Tunisia Turkey Turkmenistan
## 14 55 0
## Tuvalu Uganda Ukraine
## 0 1 130
## United Kingdom United States of America Uruguay
## 5991 57611 68
## Uzbekistan Vanuatu Venezuela, RB
## 0 0 415
## Vietnam Western Samoa Yemen, Rep.
## 7 0 0
## Yugoslavia Zambia Zimbabwe
## 159 0 15
So already we can tell we have a problem (aside from how gigantic and unwieldy this table is). Almost half of the pieces are from the US, and the proportion of pieces from most other countries are so small that they are going to be impossible to put on any kind of meaningful scale. Luckily, the solution here is simple, which is that we can transform the variable, and plot the counts on a log scale instead. This has the effect of compressing the distribution; we just need to remember that each one unit increase in the logged count is an exponential increase in the raw count - so unlike a rearview mirror, distances are bigger than they appear. For this case, when we are just mapping values, this kind of logged outcome isn’t a huge deal in my opinion - it will still give you an idea of the general trend, and is actually a nice way to look at change over time - but for other types of visualizations, it could be misleading or hard to interpret (something I’ll maybe get into in another post). I’m actually going to log this outcome in a later part of the code, because it’s a bit easier to do there, but I wanted to include this as a flag that it is always important to look at some simple tabulations, if possible, before getting to more complicated visualizations.
I said I wanted to look at this over time, so I need to work a bit with the date acquired variable. The code below first converts the variable from string to date format (using the extremely useful anytime function, which saved me even worrying about how R deals with dates), checks for missing data, and then creates five year breaks.
# convert from string to date format
dfFinal$DateAcquired <- anytime(dfFinal$DateAcquired)
# check % missing
length(dfFinal$DateAcquired[is.na(dfFinal$DateAcquired)])/length(dfFinal$DateAcquired)
# only 4% missing data which is nice- I'm OK with listwise deletion for now
# Create five year break variable
dfFinal$year <- format(dfFinal$DateAcquired, format="%Y") # Create year variable
dfFinal <- dfFinal[dfFinal$year>=1930 & dfFinal$year<=2015,] # I'm going to exclude 1929 and 2016 to get nicer breakpoints
breaks <- seq(from=1930, to=2015, by=5) # create breaks
dfFinal$fiveyear <- cut(as.numeric(dfFinal$year), breaks)
subsets<-split(dfFinal, dfFinal$fiveyear, drop=TRUE) # and break it into 17 seperate datasets -- one for each five year break
I then can actually do the mapping! You’ll need to download world shape files; I downloaded the ones I use from Natural Earth, a truly awesome site, which you can read about more here.
##################################################
# Map Data prep
#################################################
# Read in shape files
countries <- readShapePoly("//Users//isaacmaddowzimet//Desktop//Experiments//ne_110m_admin_0_countries//ne_110m_admin_0_countries.shp") # Change the file path to wherever these are saved.
# Change projection, so the world map doesn't look bizarre, and save as a new object
proj4string(countries) <- CRS("+proj=longlat")
winkel <- "+proj=wintri"
countries_winkel <- spTransform(countries, CRS(winkel))
# Add some margins to the plot
# par(mar=c(1,1,1,1))
# convert the map object to a dataframe (so ggplot can use it)
artmap <- tidy(countries_winkel, region="name_sort")
artmap <- rename(artmap, c("id"="Country")) # rename variables in nationality/country list, to make merging easier
dim(artmap)
# rename French Guiana (since it's a department of france, right now it's grouped with france, which is accurate, but makes it look like the MoMA was acquiring a whole bunch of art from South America in the early 1930s )
artmap$Country[artmap$group=="France.2"] <- "French Guiana"
# create counts by country, and merge with map dataframe
artmaplist <- list()
for (i in 1:17) {
data1 <- as.data.frame(table(subsets[[i]]$Country))
data1 <- rename(data1, c("Var1"="Country"))
artmaplist[[i]] <- join(artmap, data1, by='Country', type='left', match='all') # Merge datasets
artmaplist[[i]]$year <- i
}
# Append all of these into one dataset
artmapfull <-artmaplist[[1]]
for (i in 2:17) {
artmapfull <- rbind(artmapfull,artmaplist[[i]] )
}
# Convert frequency into log frequency (I told you I would get to this eventually)
artmapfull$logfreq <- log(artmapfull$Freq)
Now I could just produce these maps individually - here’s 1931-1935:
# Map the first year, to try it out
ggplot(artmapfull[artmapfull$year==1,], aes(long, lat, group=group, fill=logfreq)) +geom_polygon() +
coord_equal() + guides(fill=FALSE) + scale_fill_distiller(palette = "OrRd", direction= 1 ) +
theme(axis.line=element_blank(),axis.text.x=element_blank(),
axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),legend.position="none",
panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),plot.background=element_blank())
But just for fun, I thought it would be nice to make a gif using the animation package. Getting this package to work is a bit involved, especially on a Mac (you have to install a program called ImageMagick, which is not as simple as it sounds; Google for details), but the code below should get you started.
# Create list of years (for label)
years <- c("1931-1935", "1936-1940", "1941-1945", "1946-1950", "1951-1955", "1956-1960", "1961-1965", "1966-1970",
"1971-1975", "1976-1980","1981-1985", "1986-1990","1991-1995", "1996-2000", "2001-2005", "2006-2010", "2011-2015")
library(animation)
ani.options(interval=1.5, ani.width = 900, ani.height= 600)
saveGIF(
for (i in 1:length(years)){
g<-ggplot(artmapfull[artmapfull$year==i,], aes(long, lat, group=group, fill=logfreq)) +geom_polygon() +
coord_equal() + guides(fill=FALSE) + scale_fill_distiller(palette = "OrRd", direction= 1 ) +
ggtitle(paste("Permanent collection acquisitions by country of origin:", years[i]))+
theme(axis.line=element_blank(),axis.text.x=element_blank(),
axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),legend.position="none",
panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),plot.background=element_blank(),plot.title = element_text(hjust = 0.5))
print(g)
}
)