360Giving Data Visualisation Challenge

Finding a 2-d notion of theme with techniques from NLP and ML

I chose to answer the question 'Who has funded what themes throughout the years?'. On reading it I was struck by the word 'theme'. After exploring the data in GrantNav, I wondered whether techniques from Natural Language Processing and Machine Learning could evince a notion of theme from the text associated with each grant. In brief, I obtained a 2-d vector corresponding to each grant by taking a weighted average of vector representations of its words and reducing each vector in this set to two dimensions so that they could be visualised in the browser. The aim therefore is that the position of each circle encodes its 'theme'; the area in each case corresponds to the amount awarded (GBP). The circles are coloured according to the funding organisation. To obtain a vector corresponding to each grant I took the average of pre-trained word vectors for the title, description and recipient organisation of each grant, weighted by the inverse document frequency (IDF) of that word among grants of the same funding organisation. I chose this weighting to suppress the influence of funding organisation on the clustering of the grants; for certain organisations the same words or phrases appeared frequently. To reduce the number of dimensions of this set of vectors from 300 to 2, I applied truncated singular value decomposition (SVD) and t-distributed Stochastic Neighbour Embedding (t-SNE). Because many of the circles overlap (due to the large variation in the order of magnitude of the amount awarded) I decided to 'collide' the circles in D3.js; effectively each circle is tethered to its position. The data I chose to display in the final visualisation comprises grants of at least £1m awarded between 2004 and 2017; however, the vectors were calculated using the full GrantNav data set. (Be warned: on my laptop this took many hours!) This restriction improves the load time and real-time performance of the visualisation, both of which are lacking. The question, of course, is whether this representation is meaningful or useful. This is to say, do the relationships between the grants in 2-d space correspond to our intuition? Medical topics occupy the northernmost area of the page and appear better-separated than other grants - I expect this is because of the precise vocabulary and often detailed descriptions. This approach is less successful with terse or boilerplate text, such as 'towards core costs'. This is very much the start of an investigation, and in future I hope to apply these techniques in other ways, e.g. performing classification tasks by evaluating the nearest neighbours of an unknown datapoint. I would encourage you to explore the landscape for yourself and see what semantic relationships you can find.

Tim Lawson

Software Developer