WebCorp Learn - Corpus Tool
Use this tool to quickly get started analysing your own texts using methods from Corpus Linguistics. This text analysis tool runs entirely in your browser. This means you don't need to install any software and the data you load stays on your computer (none of it is sent to our servers). Example texts can be used before providing your own.
What is Corpus Linguistics?
Corpus Linguistics is the study of language based on real-world text data, often referred to as a "corpus" (plural: "corpora"). A corpus is essentially a collection of digital texts. These texts can come from various sources like books, newspapers, websites, or spoken dialogue.
How can I analyse my texts in this tool?
This tool allows you to quickly get started analysing your own texts (documents or spreadsheets). You can use it to find the words occurring most frequently in your texts, find words that tend to occur near each other (providing an overview of how a word is used), and extract examples of words from the text. This tool is designed to be an introduction to these methods, favouring the use of simple frequency counts over more complex statistics.
This tool is being developed by the Research and Development Unit for English Studies. We welcome your feedback, which can be submitted using our feedback form.
Use the options below to add files to analyse. Valid file types are plain text (.txt), PDF (.pdf), Word document (.docx) and spreadsheets in comma separated value format (.csv).
Drag and drop files:
Drag and drop your files here
Or choose files:
Or paste in some text:
Or use a pre-built corpus:
Transhistorical Corpus of Written English (TCWE)
The Transhistorical Corpus of Written English (TCWE) is a 500,000 word diachronic text corpus. The texts within the corpus range in date from the fifteenth to the twenty first century and cover five text types: sermons, statutes, letters, emails and instant messages. The TCWE was developed at Edge Hill University as part of a project which has taken place between 2019-2021, directed by Dr Imogen Marcus and with the assistance of Dr Ursula Maden-Weinberger.
The Transhistorical Corpus of Written English (TCWE) has been designed to investigate innovation in digital written language, in particular the way it has been previously been conceptualised as a hybrid of speech and writing, in a historical context. It is for this reason that the corpus contains sermons (towards the speech end of a conceptual speech-writing continuum), statutes (towards the writing end of a conceptual speech-writing continuum), as well as letters, email and instant messages. However the corpus does not need to be used for just this purpose.
Table 1 below shows the number of words for each text type. Sermons, statutes, and letters stretch back to the medieval period, whilst email and instant messaging are confined to the twenty first century.
| Text Types | |||||
|---|---|---|---|---|---|
| Century | Sermons | Letters | Instant messaging | Statutes | |
| 15th C | 20456 | 20819 | 0 | 0 | 20307 |
| 16th C | 19846 | 20139 | 0 | 0 | 20666 |
| 17th C | 19930 | 22718 | 0 | 0 | 20599 |
| 18th C | 20336 | 21657 | 0 | 0 | 20182 |
| 19th C | 19393 | 19880 | 0 | 0 | 18949 |
| 20th C | 19823 | 20122 | 0 | 0 | 21225 |
| 21st C | 19966 | 0 | 50990 | 44354 | 19273 |
The corpus files are in CSV format, one file per century. Table 2 shows which files contain each combination of text types and period.
| Text ID label | Century and text type | CSV file |
|---|---|---|
| S15, e.g. S15_001 | 15th Century sermon, first text file (text file number included for information) | TCWE_S15.csv |
| S16 | 16th Century sermon | TCWE_S16.csv |
| S17 | 17th Century sermon | TCWE_S17.csv |
| S18 | 18th Century sermon | TCWE_S18.csv |
| S19 | 19th Century sermon | TCWE_S19.csv |
| S20 | 20th Century sermon | TCWE_S20.csv |
| S21 | 21st Century sermon | TCWE_S21.csv |
| T15 | 15th Century statute | TCWE_T15.csv |
| T16 | 16th Century statute | TCWE_T16.csv |
| T17 | 17th Century statute | TCWE_T17.csv |
| T18 | 18th Century statute | TCWE_T18.csv |
| T19 | 19th Century statute | TCWE_T19.csv |
| T20 | 20th Century statute | TCWE_T20.csv |
| T21 | 21st Century statute | TCWE_T21.csv |
| L15 | 15th Century letter | TCWE_L15.csv |
| L16 | 16th Century letter | TCWE_L16.csv |
| L17 | 17th Century letter | TCWE_L17.csv |
| L18 | 18th Century letter | TCWE_L18.csv |
| L19 | 19th Century letter | TCWE_L19.csv |
| L20 | 20th Century letter | TCWE_L20.csv |
| E21 | 21st Century email | TCWE_E21.csv |
| I21 | 21st Century instant message | TCWE_I21.csv |
Additional metadata (properties of the texts) is available for letters, including author name, author gender, recipient name and recipient gender. The TCWE was compiled from a range of sources that have their own approaches to historical spelling and abbreviation, which have been modified for consistency. More information can be found in the TCWE metadata and annotation speadsheet.
Novels of Charles Dickens
This corpus contains the full text of 16 novels written by Charles Dickens with a total word count of 3.8 million words.
The texts you have imported are listed in the table below. The filename and size in words is displayed for each text.
Choose files to analyse using the Import Files tab.
Choose a word from the Frequent Words tab using the buttons.
Choose a word from the Frequent Words or Collocates tabs using the buttons.
Choose a word from the Frequent Words tab using the button.
It is worth noting that the amount of data this tool can process is limited by the browser's capabilities. In testing, we've found it can comfortably process 5 million words in the latest, popular browsers, and it is capable of processing more. 5 million words is about the same as:
- 1,000 novel chapters, or
- 6,000 news articles, or
- 500 academic articles, or
- 200,000 tweets, or
- 125,000 open-text survey comments
There are also limits on maximum file sizes in all browsers (which tend to change). This means importing the number of items suggested above might need to be done using in multiple stages or using multiple files (a consideration for huge spreadsheets). Due to limits in browser storage, refreshing the page will empty the list of files being analysed.
Using any of the options on this page will extract the text from the files and add the text to the corpus.
If importing a CSV file, the full text of the file will imported by default, then specific columns can be selected on the Corpus Files tab.
If copying and pasting in text, you can also specify a name. If the name is kept blank, one will be generated automatically.
This page shows a list of files you have imported and the progress made in extracting the text. Once the files are ready, you can click the Analyse word frequency button to continue. Texts can be removed by clicking on the remove buttons on the right hand side of the table.
CSV files have some extra options. Once a CSV file is imported you can click the Choose Columns button to limit the analysis to specific columns. This opens a dialog box containing a sequence of checkboxes. Use the first checkbox to specify if the first row of the CSV file should be treated as a header row (a row containing the names of the columns). The other checkboxes allow you choose the columns to inlcude in the analysis, with the default selection being the column containing the most text. A short snippet is presented from each column so you can make your selection (scroll down if not all snippets are visible).
Frequent words are displayed in a wordlist, which shows each word alongside the number of times it occurs in the corpus (i.e. its frequency). The top 1000 words are shown. The list can filtered using the Wordlist Options and regenerated by clicking the Generate Wordlist button.
Each word in the list can be analysed in more detail by using the three buttons on the right hand side of the table. Clicking the button will show other words which occur near the word, called Collocates. Clicking the button will show examples of the word in context, called Concordances. Clicking the button will show the frequency distribution of the word across texts.
Collocation is all about finding words that frequently occur together. When you select a word from the Frequent Words tab, a list of all the words in the corpus which occur near it (its 'collocates') will be displayed here. By default, a span of 4 words to the left and right is used, so every word that occurs within that span of the selected word will be counted and presented in the collocate list.
The collocation span can be changed using the Collocation Options. For example, to only count words occurring immediately after the selected word, set Left span to None and Right span to 1 word. Or to only count words occurring 1 or 2 places before the selected word, set Left span to 2 words and Right span to None.
The right hand side of the list shows the button. Clicking this will show examples of the pair of words in context.
Concordances are a way of presenting examples of word use along with the immediate context. This tool displays concordances in KWIC (Key Word In Context) format, where the selected word is displayed in a column down the centre of the page and the words occuring to the left and right are displayed in adjacent columns. This format is useful for finding patterns in the text surrounding the search term - useful for linguistic study. The name of the text in which the example was found is also displayed, including the column and row number for CSV files.
The Concordance Options allow you to change how the concordance lines are sorted. The default is a random selection of examples from the corpus. Sorting alphabetically by the left or right context can help reveal repeated patterns of word use. The size of the context being displayed can also be changed.
Initially, 1000 concordances will be displayed in the table. Scrolling to the bottom of the page will load the next 1000.
The bar charts and tables below show how much the selected word occurs across texts in the corpus or across rows of a CSV file. This can be used to find out if a word is more frequent in some texts than others. For CSV files, it can be used to see how often a word occurs across the values of each column (i.e. across categories). The summary data for each column is plotted in a bar chart and presented in a table below.
Either the frequency, frequency per million words, number of texts, or percentage of texts per category can be plotted on the bar charts. Frequency per million words, also called relative frequency or normalised frequency, is the default. This is useful when calculating frequency across texts of different lengths, as raw frequencies can be skewed by longer texts. Number/percentage of texts shows, for each value of the column, how many of the texts contain the selected word, but not how many times within the texts. This may be useful when working with lots of small texts, like you might have in a CSV file.
The orientation and order of the bar charts will change based on whether the values are numbers or not. Note that numbers are still treated as individual categories.