Language
Fact-checked

At LanguageHumanities, we're committed to delivering accurate, trustworthy information. Our expert-authored content is rigorously fact-checked and sourced from credible authorities. Discover how we uphold the highest standards in providing you with reliable knowledge.

Learn more...

What Is a Speech Corpus?

T. Carrier
T. Carrier

A speech corpus, also known as a spoken corpus, is a collection of speeches preserved in audio or text format. These collections are useful in developing speech software and in conducting linguistic studies. The two varieties of speech corpus are spontaneous speech and read speech.

It is important to define what the words "speech" and "corpus" mean. Speech comprises collections of thoughts and facts, usually in a spoken form. One may also view any spoken utterance as speech. A corpus, in turn, references a formal collection of various pieces of information.

A speech corpus, also known as a spoken corpus, is a collection of speeches preserved in audio or text format.
A speech corpus, also known as a spoken corpus, is a collection of speeches preserved in audio or text format.

Users generally create a speech corpus via either audio recordings or text-based transcriptions. Recordings may be made via sound storage technologies and stored — often as MP3 files in electronic databases — to create a corpus. A transcriber, on the other hand, converts spoken speech into a written form, which is then compiled with other transcriptions.

A transcriber converts spoken speech into a written form.
A transcriber converts spoken speech into a written form.

Any type of speech may be found in a speech corpus, but such databases are generally divided into two categories. The first, spontaneous speech, contains non-formalized speeches a person might give, such as those found in conversations or in oral story-telling. Read speeches, however, have a more formalized and pre-planned structure. Examples might include political speeches, news broadcasts, and audio book readings. Some varieties may be dependent on the specific context, like interviews.

One major advantage of speech corpus tools is their practical usefulness in helping create speech-based software. For example, many computers and other electronic devices present speech recognition features as an option, such as reading back typed text, transforming spoken words into text, or identifying a speaker by unique vocal traits. Extractions from a speech corpus might aid in enhancing this technology by applying mathematically based sets of statistics called acoustic models to each individual sound. In addition, the databases can assist with developing language learning audio tapes.

These functions tie in with another application for a speech corpus. Namely, scholars can take these preserved audio or written files and study the subtle grammatical variations that comprise language. Therefore, a speech corpus can serve as a valuable tool for learning about pronunciation, word order, and other linguistic models. Researchers can further compare similarities and differences in various regional dialects and languages if they create a collection with multiple languages, or a multilingual corpus. Evaluation of corpora involving speech is a specialized research concentration known as corpus linguistics, and its computerized implementation is called computational linguistics.

Many transcript databases include notations or tags that contain information about the individual components in a piece of text. This process is called annotation. In the process of abstraction, linguists will document and translate various terms in a speech. Such input may be useful if an individual wishes to learn about unknown civilizations through texts. The final step of corpus study involves analysis, or deriving comparisons and theoretical ideals from a collection of speech components.

Discuss this Article

Post your comments
Login:
Forgot password?
Register:
    • A speech corpus, also known as a spoken corpus, is a collection of speeches preserved in audio or text format.
      By: Focus Pocus LTD
      A speech corpus, also known as a spoken corpus, is a collection of speeches preserved in audio or text format.
    • A transcriber converts spoken speech into a written form.
      By: Mark Abercrombie
      A transcriber converts spoken speech into a written form.