Playing around with MoMA data (Part I: Mapping)
One of the datasets that I first played around with when I was sloooowly learning to use R was a really beautiful dataset that MoMA compiled and made publically available on Github, which has selected metadata on all 131,266 pieces in their permanent collection. It appealed to me because it looked like a slightly tricky dataset to work with – with variables in a lot of formats, and with kinds of data I wasn’t used to working with – but not so tricky that I wouldn’t get anywhere.
I’m mainly interested in R as a data exploration and visualization tool, since I still do much of my “serious” work in Stata. Because of this, I’ve been using the dataset as a way to experiment with different ways of visualizing data – both for myself, to understand the data better (“exploratory” visualization), and for telling a specific story to others (“explanatory” visualization). At first I thought I would just settle on one visualization, and share that here, but I actually haven’t been particularly happy with any of them. Instead, I thought what might be nice would be to do a series of posts on different ways of visualizing the same data, all motivated by the same central research question. Since this is the first one, I’m also going to share some of my struggles with manipulating the data in the first place - which might make this post a bit long, but feel free to skip to the parts that interest you the most.
I also want to acknowledge that a lot of other people have used this data to do all sorts of interesting things (and maybe the very things I’m doing here!). I’m going to try to link to them when I can, but I’m sure I’ll have missed some things; feel free to let me know if I have. So, without further ado:
The Research Question
It is I think pretty well known that many museums’ permanent collections, including MoMA’s, tend to be skewed towards ‘western’ artists, because of historical inequities in what was (or is) considered notable or worthy art (and in the case of MoMA, probably what was/is considered “modern” art. I’m by no means an art historian, but these seem like safe bets). What I’m interested in, however, is how this has shifted over time, as art trends and social contexts have changed. There’s been at least one very nice visualization that has looked at exhibitions by artist’s nation of origin over time, but I thought it would be interesting to instead look at the museum as a collector, whose permanent collection acquisitions reflect institutional priorities and larger sociopolitical trends. Luckily, the MoMA data has information for almost every piece in their permanent collection on who the artist(s) were, their countries of origin (though there were complications with this), and the date the piece was first acquired.
The Visualization
For those less interested in the code/process, I thought I would put the “final” visualization up front (for those interested in code/process, welcome! You will find it in excruciating detail in the next section). For this first crack at visualizing the data, it seemed like it made sense to show countries of origin on a world map, with shading related to how many pieces acquired in a given year were made by an artist with that nationality. Since I wanted to show how this changed over time, I broke the data into 5 year ‘chunks’ - so the chunk “1931-1935” represents the acquisitions made by the museum in the early 30s, the chunk 1936-1940 represents the same but for the late 30s, and so on. And because I am a child of the internet age, I wanted to make it a GIF.
Now this may go a bit fast, but generally, what you can see (besides the fact that the vast majority of acquisitions in every period are from American artists), is that the MoMA started to acquire more art from Latin American artists starting in the late 1930s, and that in the 1960s, started acquiring much more art from countries in Asia. It’s not really until the 2000s that MoMA started acquiring meaningful amounts of art from any countries in Africa, and even then, the numbers are fairly limited.
There are quite a few more notes about the process and the data below, but generally, I think this visualization approach, while it looks cool, is actually a bit hard to follow. I don’t know that we need this level of detail, for one, and the point that it’s trying to make is a bit lost. In future posts, I’m going to explore some less fancy ways to show some of the same trends; for now, all of you code lovers out there see below:
Struggling With the Data
As you’ll see, this data was a pain to work with, which led me to load/install so many packages that I’m not actually sure which ones were even necessary in the end. I’m putting them all here, for completeness’ sake, but it’s possibly you don’t need all of them - at some point I’ll try to go through and remove the unecessary ones.
The data is available as a csv, so I just load it directly from GitHub. The benefit of this is that as MoMA improves the data, the results might improve; the disadvantage is that if they change any of their variable names all of my code might break. C’est la vie, and all that.
The MoMA data has nationality as a field already (!) which I got very excited about at first. But then I looked at it.
Just from these top rows, it’s clear it has issues. In particular, it seemed to have problems with cases where the artists had dual nationalities. In some other cases (not shown here), it just chose one, seemingly arbitrarily, which I didn’t want either.
The artist bio field seemed like a better shot, though a bit harder to work with. The dates were a bit garbled but I wasn’t interested in those, and it seemed like it would be possible to extract the nationalities. What’s more, there were only 4,656 pieces with missing data, which isn’t bad out of 131,266 pieces overall (4%).
To actually get the nationalities out of the artist bio field, however, I had to use the grep function, which I had never used before (as will be crystal clear from the code below). The rough idea of the code is to cycle through a list of nationalities, and for each, extract the pieces which had that nationality mentioned in the artist bio section. I would then end up with a set of dataframes (one for each nationality), with all of the pieces that were created by an artist of that nationality. A key thing to note here is that this means that pieces are double counted if the artists have dual nationalities or if there are multiple artists from different countries. I actually decided this is what I wanted – if a piece is by an artist who is both French and Algerian, I don’t want to choose which country the piece will be attributed to – but it is something to keep in mind when interpreting the results. It’s also a good example of the kinds of substantive data choices that you have to make even for a simple analysis, which aren’t often as well documented as they should be in the academic literature (which is why code sharing is great!).
The code below does what I just described; it isn’t the most elegant, so if anyone has better ideas for how to do this, I would love to hear.
So now I have one big dataset, with each piece either taking up one or multiple rows (in the case of dual nationalities/multiple artists). As a gut check, I can look at the N of this new dataset:
Hmm. I started out with an N of 131,266, of which 4,656 had missing data for the artists bio field. 131,266 - 4,656 = 126,610, but then I know some pieces are double counted, so I should have more than that number. Instead, I have less. Probably what happened is that some of the artist’s bio fields had information but not valid nationalities, so they weren’t initially captured in my missing data check; since I don’t have a good way to impute these though, and the drop in size doesn’t seem alarming, I proceed onwards and upwards.
Now I have nationalities but not countries, so I need to merge those in if I want to do anything with maps.
So now we can actually start looking at what we have! I know that eventually I want to look at artists’ country of origin by date of acquisition, but I tend to always just want to do a one way tabulation of my variable to get a sense of what I have, and make sure that I haven’t made some horrible, horrible mistake.
So already we can tell we have a problem (aside from how gigantic and unwieldy this table is). Almost half of the pieces are from the US, and the proportion of pieces from most other countries are so small that they are going to be impossible to put on any kind of meaningful scale. Luckily, the solution here is simple, which is that we can transform the variable, and plot the counts on a log scale instead. This has the effect of compressing the distribution; we just need to remember that each one unit increase in the logged count is an exponential increase in the raw count - so unlike a rearview mirror, distances are bigger than they appear. For this case, when we are just mapping values, this kind of logged outcome isn’t a huge deal in my opinion - it will still give you an idea of the general trend, and is actually a nice way to look at change over time - but for other types of visualizations, it could be misleading or hard to interpret (something I’ll maybe get into in another post). I’m actually going to log this outcome in a later part of the code, because it’s a bit easier to do there, but I wanted to include this as a flag that it is always important to look at some simple tabulations, if possible, before getting to more complicated visualizations.
I said I wanted to look at this over time, so I need to work a bit with the date acquired variable. The code below first converts the variable from string to date format (using the extremely useful anytime function, which saved me even worrying about how R deals with dates), checks for missing data, and then creates five year breaks.
I then can actually do the mapping! You’ll need to download world shape files; I downloaded the ones I use from Natural Earth, a truly awesome site, which you can read about more here.
Now I could just produce these maps individually - here’s 1931-1935:
But just for fun, I thought it would be nice to make a gif using the animation package. Getting this package to work is a bit involved, especially on a Mac (you have to install a program called ImageMagick, which is not as simple as it sounds; Google for details), but the code below should get you started.