« Where are All Those Hits Coming from? | Main | Project Watch: Semantically-Annotated New Testament (SemANT) »

Project Watch: The New Testament Hyper-concordance

I thank Sean Boisen for bringing this project to my attention. There's a lot of interesting stuff on his site, which I feel tempted to blog about, but today I'd just like to stick to the New Testament Hyper-concordance. In his Blogos he describes the project thus:

The basic idea is to navigate the space of Scripture directly using words. Most Scripture websites have a search box where you enter a word to find verses that use that word. For example, searching the English Standard Version New Testament for the word "pots" finds two verses, Mark 7:4 and Revelation 2:27 (you need to use the advanced search and select "Exact matches only"). From the standpoint of connecting information, this provides a link from a single word to one or more verses of Scripture.

Taking this idea one step further, given the text of the verse, you can just embed a hyperlink from the word in question to other verses, preserving the context. Now here's where the idea takes off: instead of just hyperlinking one word, suppose every word is hyperlinked? This more tightly connects the information and gets you directly from the context of one verse to another with similar content (because of similar words). With some special processing to index the words, every word can link to a list of verses, each word of which is in turn hyperlinked to others, each word of which ... you get the idea.

Here's an example from the page for "Scripture" (the links are live into Hyper-concordance):

2Tim.3.16
All scripture is inspired by God and profitable for teaching , for reproof, for correction, and for training in righteousness,

The word "scripture" isn't hyper-linked ([Ed.] It is now), since that would take you back where you already are. This is the only occurrence of the words "reproof" and "correction" in the New Testament, so there's no benefit in linking these: you'd only get to the same verse. The other unlinked words are high-frequency function words: they could be linked, but there would be little added value, and it would take a lot more space (the entire hyper-concordance as static HTML only amounts to about 30Mb).

Inflected verbs and plural nouns are linked to their base forms (in this example, "inspired" -> "inspire", "training" -> "train"). Most other Scripture search engines i've seen either match exactly (treating "inspired" and "inspire" as two different words), match substrings (for "pot", this has the peculiar result of matching "spots" and "Mesopotamia"!), or match from the beginning of the word ("pot" matches "pots", "potter", etc.). I wanted to try to do a better job about matching the real dictionary form of words...

The output looks really nice, as you can see for yourself if you click on the hyperlinks found in the verse above. I like his general approach, and since it is an open-source project, you can actually access the code, check his scripts, modify his baseforms lists (with sections on past tenses, regular plurals, irregular plurals, -ing forms, and irregular pasts), his list of stopwords, or even choose a different version (instead of the original RSV). The author welcomes suggestions and bug reports, which he hopes to fix in future versions.

One of the things that I found particularly interesting was his explanation of some of the linguistic issues he had to wrestle with in the course of his work. Although we are dealing here with an English version of the Bible, therefore a language with a rather limited amount of inflection, this is a good test case for the kinds of issues scholars have to face when they "tag" Hebrew or Greek texts. Sean tells of the list he created in order to map inflected forms and plurals to their bases (root o dictionary forms), and how complex this turned out to be. Apart from clear oversights (like keeping "child" and "children" as two separate forms), there is a whole philosophical framework that precedes the actual tagging of the text. I'll have more to blog about this issue one of these days, but suffice it to say here that it is very important to make explicit what our presuppositions are.

Whatever course we decide to take, certain searches will be clearly word-based, while others will have to be based on related meanings (technically speaking, semantic domains). This dilemma is a very simple example of the creative tension that exists between form and meaning, between morphology and function. It is very clear to me that a single database cannot be all things to all people. Maybe we should stop asking for the impossible (i.e., absolute consistency in tagging) and start considering alternatives like multi-level tagging and specialized databases for specific types of uses. IMO, this is the way to go. Admittedly, this would be a quantum leap for Bible software, but a very necessary and stimulating one. Rubén dixit ;-)

About

This page contains a single entry from the blog posted on April 6, 2004 6:52 PM.

The previous post in this blog was Where are All Those Hits Coming from?.

The next post in this blog is Project Watch: Semantically-Annotated New Testament (SemANT).

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35