Daniel Hopkins and Gary King. "A Method of Automated Nonparametric Content Analysis for Social Science," forthcoming American Journal of Political Science, copy at http://gking.harvard.edu/files/abs/words-abs.shtml. (Article: PDF)

Abstract

The massive increase in text available in digital formats presents enormous opportunities for social scientists. Yet systematically hand coding a significant share of the available blogs, speeches, emails, web pages, government records, newspapers, or other digitized texts is infeasible. Although computer scientists have developed effective methods for automated content analysis, those methods aim to classify individual documents correctly, whereas social scientists are usually interested in generalizations about the population of documents, such as the proportion in a given category. Unfortunately, even classifiers that categorize individual documents with high accuracy can be hugely biased when estimating category proportions. By directly optimizing for the broader goal of many social scientists, we develop a method that gives approximately unbiased estimates of the category proportions. We illustrate the method with several diverse data sources, including the daily expressed opinions of hundreds of thousands of people about the U.S.\ presidency. We also make available easy-to-use software that implements our methods and large corpora of text for further analysis.

Also see related research on content analysis.