Corpora

ViMELF and CASE corpora

224,000 word corpus transcribed from video-mediated conversations in an international English-language context. Created as part of the CASE Project.

This corpus consists of 224,487 words from 35 transcripts of online video conversations. Two sub-corpora are available:

ViMELF: a corpus of Video-Mediated English as a Lingua Franca conversations

ViMELF contains 20 Skype conversations (114,946 tokens) between 40 speakers from Germany (20 speakers), Spain (5), Italy (5), Finland (5), and Bulgaria (5), totaling 744.5 minutes (ca. 12.5 hours), with an average conversation length of 37.23 minutes.

TaCoCASE: Transatlantic component of the Corpus of Academic Spoken English

TaCoCASE consists of 15 computer-mediated conversations (109,541 tokens) between 26 international University students from Germany (8 speakers), the United Kingdom (10), and the United States (8), totalling 650 minutes (ca. 10.5 hours), with an average conversation length of 43 minutes.

This interface combines plain text and annotated versions of the transcripts. Corpus analysis (e.g. word/lemma frequency lists, collocation and concordancing) can be performed on the plain text. The text is in lowercase, speaker IDs are displayed to mark breaks between utterances, and utterances without words are marked by ␣. Concordance results also include popups which show the original markup from the transcripts, as well as the video timestamp and speaker ID.

More information about the corpus, including the transcription conventions, can be found in the CoRD corpus description or on the CASE project website. The latter also provides details for obtaining the source transcripts and video files.

Birmingham Blog Corpus

500 million word corpus built from English language blogging websites, including a 180 million word sub-corpus separated into posts and comments.

This corpus consists of 493,840,018 words extracted from blog texts. The corpus is split into sections according to how the texts were discovered and downloaded:

Technorati297,673,541 words
Crawled the top blogs ranked by Technorati.
Google Blog Search18,419,018 words
Downloaded new posts daily as identified by Google Blog Search.
Blogspot and Wordpress177,747,459 words
Crawled blogs hosted on blogspot.com and wordpress.com. Extracted date information and separated posts from comments:
posts93,347,696 words
comments84,399,763 words
Synchronic English Web Corpus

470 million word corpus built from web-extracted texts. Including a randomly selected 'mini-web' and high-level subject classifaction.

This corpus consists of 468,901,590 words (tokens) from web-extracted texts. It covers the period 2000-2010 split into the sub-corpora below.

Mini-Web

341,592,476 words from 100,000 randomly selected web-pages to form sample of the distrubution of texts throughout the web.

Domains

127,309,114 words from 56,000 pages selected based on the Open Directory classification of web pages. Each domain consists of 4,000 pages.

Arts7,679,629 words
Business7,959,532 words
Computers9,236,133 words
Games9,255,064 words
Health10,364,577 words
Home6,992,879 words
Kids and Teens9,737,110 words
News6,820,547 words
Recreation7,417,516 words
Reference12,096,167 words
Science14,812,032 words
Shopping5,157,248 words
Society11,247,618 words
Sport8,533,062 words
Diachronic English Web Corpus

130 million word corpus randomly selected from a larger collection and balanced to contain the same number of words per month.

This corpus consists of 129,705,810 words (tokens) from web-extracted texts. It covers the period Jan 2000 - Dec 2010. Each month contains approximately 1 million words.

Webpages are dated using the method described in Kehoe (2006) below. Only pages where the date was discovered from the server header, HTML metadata or in the textual body near "last modified" or similar are included.

The web corpora available in WebCorp LSE were constructed as outlined below.

Miniweb

The miniweb section of the corpus is intended to be a microcosm of the web. In other words, as few restrictions as possible were placed on the selection of pages with the goal of obtaining a distribution of pages within the corpus that is roughly similar to the distribution of pages from sites across the web itself. Our method for selecting the pages takes inspiration from Baroni and Bernardini (2004):

  1. Select high frequency words from the British National Corpus and our existing newspaper corpora, excluding grammatical words.
  2. Create 100 combinations of three words by choosing words at random from the high frequency word list.
  3. Submit these combinations of words to the Google Search API (no longer in operation) and retrieve the top 5 hits.
  4. Use the full set of Google hits as the seeds for a 'broad' web crawl. A broad crawl means that any link to any web page found may be followed by the crawler. The Heritrix crawler (version 1.12) was used.
  5. Run the crawl for one day.

The above steps were repeated every 6 days between August 2008 and February 2011.

Domain specific corpora

The domains used in the this section of the corpus are based on the Open Directory project. Open Directory is a manually curated list of websites split into a number of subject domains. Unfortunately the Open Directory is no longer in operation. The method used was:

  1. Choose a top level domain from the Open Directory project.
  2. Extract all URLs for sites listed under that domain.
  3. Choose 50 URLs at random and use these as seeds for a limited web crawl. The crawl is limited to the path of the URL, so only links to pages in the same sub-path may be followed by the crawler. The Heritrix crawler (version 1.12) was used.
  4. Run the crawl for one day.

The above steps were repeated for each domain over a cycle of 17 days (interspersed with the miniweb crawls from above) between August 2008 and February 2011. The crawl of each domain was run on a different day of the week each cycle.

Text clean-up

The following modules were used to clean-up the textual data.

  • Language detection based on Cavnar and Trenkle (1994).
  • Boilerplate removal based on the Body Text Extraction algorithm. Boilerplate is content outside of the main textual body of the page, such as menus, headers and footers (see Kehoe and Gee 2007).
  • Duplicate detection using a document fingerprinting technique. The textual content of a page is reduced to a small number of hashes which can be checked against hashes already known by the system.
  • Date detection using the method described in Kehoe (2006).
  • Tokenising, POS-tagging and lemmatisation using the Standford Core NLP tools (version 3.8) (Manning et al. 2014).

Blogs

The Birmingham Blog Corpus was constructed using a combination of the methods above for the Technorati and Google Blog Search sub-corpora. The Blogspot and Wordpress sub-corpus was processed with specific HTML parsing rules to enable the accurate extraction of blog posts and comments. See Kehoe and Gee (2012) for more details.

References

Baroni, M. and S. Bernardini. 2004. BootCaT: Bootstrapping Corpora and Terms from the Web. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC'04), 1313-1316. Lisbon, Portugal.

Cavnar, W. and J. Trenkle. 1994. N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, 161-175. Las Vegas, USA.

Kehoe, A. 2006. Diachronic Linguistic Analysis on the Web with WebCorp. In The Changing Face of Corpus Linguistics, edited by A. Renouf and A. Kehoe, 297-308. Amsterdam: Rodopi.

Kehoe, A. and M. Gee. 2007. New corpora from the web: making web text more 'text-like'. In Towards Multimedia in Corpus Studies, edited by P. Pahta, I. Taavitsainen, T. Nevalainen and J. Tyrkkö. Electronic publication, University of Helsinki.

Kehoe, A. and M. Gee. 2012. Reader comments as an aboutness indicator in online texts: introducing the Birmingham Blog Corpus. In Aspects of Corpus Linguistics: Compilation, Annotation, Analysis, edited by S. Oksefjell Ebeling, J. Ebeling and H. Hasselgård. Electronic publication, University of Helsinki.

Manning, C. D., M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard and M. McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55-60.