Corpora
ViMELF and CASE corpora
224,000 word corpus transcribed from video-mediated conversations in an international English-language context. Created as part of the CASE Project.
This corpus consists of 224,487 words from 35 transcripts of online video conversations. Two sub-corpora are available:
ViMELF: a corpus of Video-Mediated English as a Lingua Franca conversations
ViMELF contains 20 Skype conversations (114,946 tokens) between 40 speakers from Germany (20 speakers), Spain (5), Italy (5), Finland (5), and Bulgaria (5), totaling 744.5 minutes (ca. 12.5 hours), with an average conversation length of 37.23 minutes.
TaCoCASE: Transatlantic component of the Corpus of Academic Spoken English
TaCoCASE consists of 15 computer-mediated conversations (109,541 tokens) between 26 international University students from Germany (8 speakers), the United Kingdom (10), and the United States (8), totalling 650 minutes (ca. 10.5 hours), with an average conversation length of 43 minutes.
This interface combines plain text and annotated versions of the transcripts. Corpus analysis (e.g. word/lemma frequency lists, collocation and concordancing) can be performed on the plain text. The text is in lowercase, speaker IDs are displayed to mark breaks between utterances, and utterances without words are marked by ␣. Concordance results also include popups which show the original markup from the transcripts, as well as the video timestamp and speaker ID.
More information about the corpus, including the transcription conventions, can be found in the CoRD corpus description or on the CASE project website. The latter also provides details for obtaining the source transcripts and video files.
Birmingham Blog Corpus
500 million word corpus built from English language blogging websites, including a 180 million word sub-corpus separated into posts and comments.
This corpus consists of 493,840,018 words extracted from blog texts. The corpus is split into sections according to how the texts were discovered and downloaded:
| Technorati | |
| Crawled the top blogs ranked by Technorati. | |
| Google Blog Search | |
| Downloaded new posts daily as identified by Google Blog Search. | |
| Blogspot and Wordpress | |
| Crawled blogs hosted on blogspot.com and wordpress.com. Extracted date information and separated posts from comments: | |
| posts | |
| comments |
Synchronic English Web Corpus
470 million word corpus built from web-extracted texts. Including a randomly selected 'mini-web' and high-level subject classifaction.
This corpus consists of 468,901,590 words (tokens) from web-extracted texts. It covers the period 2000-2010 split into the sub-corpora below.
Mini-Web
341,592,476 words from 100,000 randomly selected web-pages to form sample of the distrubution of texts throughout the web.
Domains
127,309,114 words from 56,000 pages selected based on the Open Directory classification of web pages. Each domain consists of 4,000 pages.
| Arts | 7,679,629 words |
| Business | 7,959,532 words |
| Computers | 9,236,133 words |
| Games | 9,255,064 words |
| Health | 10,364,577 words |
| Home | 6,992,879 words |
| Kids and Teens | 9,737,110 words |
| News | 6,820,547 words |
| Recreation | 7,417,516 words |
| Reference | 12,096,167 words |
| Science | 14,812,032 words |
| Shopping | 5,157,248 words |
| Society | 11,247,618 words |
| Sport | 8,533,062 words |
Diachronic English Web Corpus
130 million word corpus randomly selected from a larger collection and balanced to contain the same number of words per month.
This corpus consists of 129,705,810 words (tokens) from web-extracted texts. It covers the period Jan 2000 - Dec 2010. Each month contains approximately 1 million words.
Webpages are dated using the method described in Kehoe (2006) below. Only pages where the date was discovered from the server header, HTML metadata or in the textual body near "last modified" or similar are included.
The web corpora available in WebCorp LSE were constructed as outlined below.
Miniweb
The miniweb section of the corpus is intended to be a microcosm of the web. In other words, as few restrictions as possible were placed on the selection of pages with the goal of obtaining a distribution of pages within the corpus that is roughly similar to the distribution of pages from sites across the web itself. Our method for selecting the pages takes inspiration from Baroni and Bernardini (2004):
- Select high frequency words from the British National Corpus and our existing newspaper corpora, excluding grammatical words.
- Create 100 combinations of three words by choosing words at random from the high frequency word list.
- Submit these combinations of words to the Google Search API (no longer in operation) and retrieve the top 5 hits.
- Use the full set of Google hits as the seeds for a 'broad' web crawl. A broad crawl means that any link to any web page found may be followed by the crawler. The Heritrix crawler (version 1.12) was used.
- Run the crawl for one day.
The above steps were repeated every 6 days between August 2008 and February 2011.
Domain specific corpora
The domains used in the this section of the corpus are based on the Open Directory project. Open Directory is a manually curated list of websites split into a number of subject domains. Unfortunately the Open Directory is no longer in operation. The method used was:
- Choose a top level domain from the Open Directory project.
- Extract all URLs for sites listed under that domain.
- Choose 50 URLs at random and use these as seeds for a limited web crawl. The crawl is limited to the path of the URL, so only links to pages in the same sub-path may be followed by the crawler. The Heritrix crawler (version 1.12) was used.
- Run the crawl for one day.
The above steps were repeated for each domain over a cycle of 17 days (interspersed with the miniweb crawls from above) between August 2008 and February 2011. The crawl of each domain was run on a different day of the week each cycle.
Text clean-up
The following modules were used to clean-up the textual data.
- Language detection based on Cavnar and Trenkle (1994).
- Boilerplate removal based on the Body Text Extraction algorithm. Boilerplate is content outside of the main textual body of the page, such as menus, headers and footers (see Kehoe and Gee 2007).
- Duplicate detection using a document fingerprinting technique. The textual content of a page is reduced to a small number of hashes which can be checked against hashes already known by the system.
- Date detection using the method described in Kehoe (2006).
- Tokenising, POS-tagging and lemmatisation using the Standford Core NLP tools (version 3.8) (Manning et al. 2014).
Blogs
The Birmingham Blog Corpus was constructed using a combination of the methods above for the Technorati and Google Blog Search sub-corpora. The Blogspot and Wordpress sub-corpus was processed with specific HTML parsing rules to enable the accurate extraction of blog posts and comments. See Kehoe and Gee (2012) for more details.
References
Baroni, M. and S. Bernardini. 2004. BootCaT: Bootstrapping Corpora and Terms from the Web. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC'04), 1313-1316. Lisbon, Portugal.
Cavnar, W. and J. Trenkle. 1994. N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, 161-175. Las Vegas, USA.
Kehoe, A. 2006. Diachronic Linguistic Analysis on the Web with WebCorp. In The Changing Face of Corpus Linguistics, edited by A. Renouf and A. Kehoe, 297-308. Amsterdam: Rodopi.
Kehoe, A. and M. Gee. 2007. New corpora from the web: making web text more 'text-like'. In Towards Multimedia in Corpus Studies, edited by P. Pahta, I. Taavitsainen, T. Nevalainen and J. Tyrkkö. Electronic publication, University of Helsinki.
Kehoe, A. and M. Gee. 2012. Reader comments as an aboutness indicator in online texts: introducing the Birmingham Blog Corpus. In Aspects of Corpus Linguistics: Compilation, Annotation, Analysis, edited by S. Oksefjell Ebeling, J. Ebeling and H. Hasselgård. Electronic publication, University of Helsinki.
Manning, C. D., M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard and M. McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55-60.