SIN meets LSA

A few weeks back I had the opportunity to spend time with some of the wonderful people at Indiana University. Together we worked on all things Newtonian, and in particular on a computational method called Latent Semantic Analysis (LSA). This is a technique that originates from the early 80s and has some of its roots – surprisingly at first – in psychology. Apparently, as a child we learn on average 10-15 words a day, and not just the words we are consciously taught. We pick up words on the fly, because they are used by the people around us in connection with the words that we already know. And although their meaning may not be directly obvious to us, we have some clue as to what they signify because of their connection to the words we are familiar with. Their meaning in a particular context – their semantics – is present in an implicit – latent – way.

So, what has this got to do with Sir Isaac Newton, aka SIN (see my “Was Newton a Man?“) and his manuscripts? Hang on, things will become clear in a minute. Let’s first see LSA in action. Or rather, see it in action again. Because, unawarely, you are very familiar with it. Ever did a Google-search and ended up with a number of hits that did not exactly contain the words you were looking for? That’s because Google and other search engines have indexed their database. They have made those implicit connections in such a way that if you are looking for, let’s say, information on Egyptian gods, you will also be referred to sites that contain stuff on the Zodiac and astrology. These sites do not contain Egyptian and gods as keywords, but Google ‘knows’ that texts on the Zodiac inevitably contain information on Egyptian gods. Well, if Google can do it – knowing things about texts without essentially reading them – we can do it. Right?

Yes we can. And we should. Newton’s manuscripts are a dazzling labyrinth of thousands of disorganised folios, paragraphs haphazardly interspersed with lines on other topics, and so on. Part of my research involves recreating the order in which Newton left his manuscripts, or at least intended them to be read. To do so, we first need to identify the pieces of the puzzle. By performing Latent Semantic Analysis, we turn all our manuscripts into a huge table. On the horizontal axis we have all of the manuscripts, and on the vertical axis all the words that those manuscripts contain. When the same word occurs with different spelling we count them as the same. I will spare you the technical details, but after some wizardry in Python and other code we end up with a bunch of vectors – columns with numbers that represent a small chunk of a manuscript with its words – and, having our silicon friend compare these vectors with others, assign a value between 0 and 1 to them, based on what is called their ‘cosine distance’. The closer these chunks are related, the closer to 1 their cosine distance is. A typical result would look like this:

Chunk on the right, chunk on the left, and a figure in read telling us how closely they are related. 0.994571 means: pretty close...
Chunk on the right, chunk on the left, and a figure in red telling us how closely they are related. 0.994571 means: pretty close…

This technique has already been employed successfully by others, notably my colleagues at Indiana that have created the Chymistry of Isaac Newton Project website. For example, the following two passages turn out to be two drafts of the same text, as you can see by comparing the highlighted words:

As you can see in their headings, these chunks come from completely different manuscripts. Yet..
As you can see in their headings, these chunks come from completely different manuscripts. Yet..

They have even developed an online tool, where you – yes, you – can perform Latent Semantic Analysis on all of Newton’s alchemical manuscripts. So, be my guest and get on with it!

Quick instructions: use Chunk-Chunk, 250 word Chunks, Descending Order, All Above Chosen Value, click Continue, select All Chunks and then click Add Chunk  and then Continue, and then (last step) select a threshold value (I suggest 0.9 for now, it will still give you plenty of results).

Click Run and be amazed.

You can click on each of the figures on the left to see the underlying manuscript-chunks, with highlighted matches. Don’t be surprised when only a few words are highlighted: these are the important words, words that are for instance very specific for this set of manuscripts and do not return elsewhere.

This technique has helped the Chymistry-team to identify related pages from completely separate manuscripts, an identification that would have been very hard without having first digitized and then transcribed all of these manuscripts. It is only now that we are able to intelligently tools and methods like LSA to do this sort of comparable research – and man, it is fun.

Next time we are going archaic, and look for dog-ears…

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s