A frequency list is a tool for quantitative linguistic analysis, a listing of everything that appears in a chosen block of text and how frequently it occurs. Linguistic analysis is a cross-disciplinary field that studies the structure of language and how it is used. Combining elements of anthropology, mathematics, computer science and logic, linguistic analysis is used for projects such as mechanical translation, cryptography and deciphering ancient writings.
Frequency lists can be listings of words or of letters. Letter frequencies typically are used in cryptography. One of the simplest codes is a substitution cipher, where each letter is replaced with another letter or symbol. For example, the message "attack at dawn" might be encoded as "zoozhl zo azqp." The benefit of substitution ciphers is that they don’t require a code book, but the weakness is that they can be cracked by comparing the frequency of letters and letter combinations within the message to a frequency list of common usage.
In Arthur Conan Doyle’s The Adventure of the Dancing Men, the fictional detective Sherlock Holmes uses frequency analysis to crack a substitution cipher. Historically, codemakers tried various tricks to make their ciphers more difficult to crack with a frequency list: rolling ciphers where the substitution used depended on a letter’s position within the message, eliminating or encoding spaces so that word frequencies couldn’t be used, keeping messages short and avoiding expected words so code-breakers wouldn’t have enough of a sample to use for frequency analysis. Ultimately, any cipher can be broken with a large enough sample, which is why more sophisticated encryption protocols have become standard.
Frequency lists of words and word types are also used in ancient language studies. When Jean-Francois Champollion translated the Rosetta Stone in the 1820s, his process used a mixture of comparing frequencies and transliterations to piece together the hieroglyphic language. Studies have shown that for ancient languages, as for modern English, a core vocabulary of 1,500 to 2,000 words covers 85-90 percent of common texts, a level that allows the reader to expand his or her vocabulary from context.
Zipf’s law, named for Harvard linguistics professor George Kingsley Zipf, is an empirical observation on the behavior of frequency ratings. It states that the frequency of an event is inversely proportional to the ranking of the event. The event is generally a word or letter in a linguistic frequency list, but Zipf’s law has been generalized to cover other phenomenon such as city populations and corporate earnings.
A frequency list is an important tool in projects to help computers make sense of spoken and written language. Mechanical translation — the use of computers to translate documents from one language to another — is one example. Another example is Watson, the natural language supercomputer that was showcased as a contestant on the television game show Jeopardy! in February 2011. Frequencies both of words and of usage types are incorporated into their programming as a tool to finding meaning.