A Personal Exploration of Digital History

A unique perspective on the expanding topic of the digital past.
A Personal Exploration of Digital History

A Way with Words

One of the most striking pieces of insight I gained through reading this online essay, entitled From Babel to Knowledge: Data Mining Large Digital Collections and authored by Dan Cohen, is the importance of which words are included and which are excluded both when executing a search and building search results. To give a bit of background information, Cohen’s task that he centered on in this article is the creation of a search engine for course syllabi that will consistently filter out other content to give a better grasp on what is being taught and how. In order to achieve this, Cohen relies on both exclusion and inclusion of certain words.

A “wordle” visualization showing common words in this very blog post.

Primarily, it is important to screen out common words in whichever language you are working with. in the case of the English language, “and,” “the,” and “a,” among others, would all be good candidates to be screened out when searching for the root of what a document is about. After these words are eliminated, it is clear to see which types of words make up the bulk of the document. While word could generators can do this in a way that results in a nice visual, other programs like Yahoo’s Term Extraction Application Programming Interface (API) can give a more “plain view” summary of the most used words in a piece of text.

These highly used words are important as a pattern will emerge as to which words are frequently used in many of the same type of document. In this case, a season (Spring or Fall) is often followed by 4 numbers and terms such as “office hours” appear frequently. By using these common terms as an objective for the search engine to find when searching only for syllabi, it is likely that the majority of the results returned will in fact be syllabi. These words can also be excluded from a search (similar to the way “and” and “a” are excluded from Google searches) in order to put aside the known commonalities of the syllabi in order to search for their true purpose and content.

Another way that frequently used words is through finding which words belong two which one of two similarly spelled words. The example used in this online essay is George H. W. Bush and George W. Bush. The problem lies in the fact that the names are only one initial apart, and there would not be much differentiation between the two in a typical search without the use of quotation marks to force the search engine into only displaying exact matches. By finding words with popularity specific to each presidency and adding them to the search, it is more likely that an accurate result will be obtained.

So, what is the moral to this story? Don’t overlook those seemingly silly word clouds. The frequency at which words are used in a given piece of text provides invaluable clues as to its content and purpose and can greatly improve the quality of a search. While the focus on design in word clouds may be too extravagant for this purpose, there is a variety of programs available to analyze words usage that can maximize the ability to find and categorize documents. Furthermore, these methods can also improve accuracy through establishing a sort of standard for words that allows them to be differentiated from similar words. Overall, it is clear that those with a way with words hold a serious advantage in creating the next subject specific search engine.

Leave a Reply