WebCorp Learn - Corpus Tool

Use this tool to quickly get started analysing your own texts using methods from Corpus Linguistics. This text analysis tool runs entirely in your browser. This means you don't need to install any software and the data you load stays on your computer (none of it is sent to our servers). Example texts can be used before providing your own.

What is Corpus Linguistics?

Corpus Linguistics is the study of language based on real-world text data, often referred to as a "corpus" (plural: "corpora"). A corpus is essentially a collection of digital texts. These texts can come from various sources like books, newspapers, websites, or spoken dialogue.

How can I analyse my texts in this tool?

This tool allows you to quickly get started analysing your own texts (documents or spreadsheets). You can use it to find the words occurring most frequently in your texts, find words that tend to occur near each other (providing an overview of how a word is used), and extract examples of words from the text. This tool is designed to be an introduction to these methods, favouring the use of simple frequency counts over more complex statistics.

This tool is being developed by the Research and Development Unit for English Studies. We welcome your feedback, which can be submitted using our feedback form.

Use the options below to add files to analyse. Valid file types are plain text (.txt), PDF (.pdf), Word document (.docx) and spreadsheets in comma separated value format (.csv).

Drag and drop files:

Drag and drop your files here

Or choose files:

Or paste in some text:

Or use a pre-built corpus:


Transhistorical Corpus of Written English (TCWE)

The Transhistorical Corpus of Written English (TCWE) is a 500,000 word diachronic text corpus. The texts within the corpus range in date from the fifteenth to the twenty first century and cover five text types: sermons, statutes, letters, emails and instant messages. The TCWE was developed at Edge Hill University as part of a project which has taken place between 2019-2021, directed by Dr Imogen Marcus and with the assistance of Dr Ursula Maden-Weinberger.

The Transhistorical Corpus of Written English (TCWE) has been designed to investigate innovation in digital written language, in particular the way it has been previously been conceptualised as a hybrid of speech and writing, in a historical context. It is for this reason that the corpus contains sermons (towards the speech end of a conceptual speech-writing continuum), statutes (towards the writing end of a conceptual speech-writing continuum), as well as letters, email and instant messages. However the corpus does not need to be used for just this purpose.

Table 1 below shows the number of words for each text type. Sermons, statutes, and letters stretch back to the medieval period, whilst email and instant messaging are confined to the twenty first century.

Table 1: Number of tokens per century in individual text type sub-corpora.
Text Types
Century Sermons Letters Instant messaging Email Statutes
15th C20456208190020307
16th C19846201390020666
17th C19930227180020599
18th C20336216570020182
19th C19393198800018949
20th C19823201220021225
21st C199660509904435419273

The corpus files are in CSV format, one file per century. Table 2 shows which files contain each combination of text types and period.

Table 2: Corpus files and text IDs listed with century and text type.
Text ID label Century and text type CSV file
S15, e.g. S15_00115th Century sermon, first text file (text file number included for information)TCWE_S15.csv
S1616th Century sermonTCWE_S16.csv
S1717th Century sermonTCWE_S17.csv
S1818th Century sermonTCWE_S18.csv
S1919th Century sermonTCWE_S19.csv
S2020th Century sermonTCWE_S20.csv
S2121st Century sermonTCWE_S21.csv
T1515th Century statuteTCWE_T15.csv
T1616th Century statuteTCWE_T16.csv
T1717th Century statuteTCWE_T17.csv
T1818th Century statuteTCWE_T18.csv
T1919th Century statuteTCWE_T19.csv
T2020th Century statuteTCWE_T20.csv
T2121st Century statuteTCWE_T21.csv
L1515th Century letterTCWE_L15.csv
L1616th Century letterTCWE_L16.csv
L1717th Century letterTCWE_L17.csv
L1818th Century letterTCWE_L18.csv
L1919th Century letterTCWE_L19.csv
L2020th Century letterTCWE_L20.csv
E2121st Century emailTCWE_E21.csv
I2121st Century instant messageTCWE_I21.csv

Additional metadata (properties of the texts) is available for letters, including author name, author gender, recipient name and recipient gender. The TCWE was compiled from a range of sources that have their own approaches to historical spelling and abbreviation, which have been modified for consistency. More information can be found in the TCWE metadata and annotation speadsheet.


Novels of Charles Dickens

This corpus contains the full text of 16 novels written by Charles Dickens with a total word count of 3.8 million words.

The texts you have imported are listed in the table below. The filename and size in words is displayed for each text.

Case insensitive
Filter out
Normalise punctuation

Choose files to analyse using the Import Files tab.

Choose a word from the Frequent Words tab using the buttons.

Choose a word from the Frequent Words or Collocates tabs using the buttons.

Choose a word from the Frequent Words tab using the button.

It is worth noting that the amount of data this tool can process is limited by the browser's capabilities. In testing, we've found it can comfortably process 5 million words in the latest, popular browsers, and it is capable of processing more. 5 million words is about the same as:

  • 1,000 novel chapters, or
  • 6,000 news articles, or
  • 500 academic articles, or
  • 200,000 tweets, or
  • 125,000 open-text survey comments

There are also limits on maximum file sizes in all browsers (which tend to change). This means importing the number of items suggested above might need to be done using in multiple stages or using multiple files (a consideration for huge spreadsheets). Due to limits in browser storage, refreshing the page will empty the list of files being analysed.

Using any of the options on this page will extract the text from the files and add the text to the corpus.

If importing a CSV file, the full text of the file will imported by default, then specific columns can be selected on the Corpus Files tab.

If copying and pasting in text, you can also specify a name. If the name is kept blank, one will be generated automatically.

This page shows a list of files you have imported and the progress made in extracting the text. Once the files are ready, you can click the Analyse word frequency button to continue. Texts can be removed by clicking on the remove buttons on the right hand side of the table.

CSV files have some extra options. Once a CSV file is imported you can click the Choose Columns button to limit the analysis to specific columns. This opens a dialog box containing a sequence of checkboxes. Use the first checkbox to specify if the first row of the CSV file should be treated as a header row (a row containing the names of the columns). The other checkboxes allow you choose the columns to inlcude in the analysis, with the default selection being the column containing the most text. A short snippet is presented from each column so you can make your selection (scroll down if not all snippets are visible).

Frequent words are displayed in a wordlist, which shows each word alongside the number of times it occurs in the corpus (i.e. its frequency). The top 1000 words are shown. The list can filtered using the Wordlist Options and regenerated by clicking the Generate Wordlist button.

Each word in the list can be analysed in more detail by using the three buttons on the right hand side of the table. Clicking the button will show other words which occur near the word, called Collocates. Clicking the button will show examples of the word in context, called Concordances. Clicking the button will show the frequency distribution of the word across texts.

Collocation is all about finding words that frequently occur together. When you select a word from the Frequent Words tab, a list of all the words in the corpus which occur near it (its 'collocates') will be displayed here. By default, a span of 4 words to the left and right is used, so every word that occurs within that span of the selected word will be counted and presented in the collocate list.

The collocation span can be changed using the Collocation Options. For example, to only count words occurring immediately after the selected word, set Left span to None and Right span to 1 word. Or to only count words occurring 1 or 2 places before the selected word, set Left span to 2 words and Right span to None.

The right hand side of the list shows the button. Clicking this will show examples of the pair of words in context.

Concordances are a way of presenting examples of word use along with the immediate context. This tool displays concordances in KWIC (Key Word In Context) format, where the selected word is displayed in a column down the centre of the page and the words occuring to the left and right are displayed in adjacent columns. This format is useful for finding patterns in the text surrounding the search term - useful for linguistic study. The name of the text in which the example was found is also displayed, including the column and row number for CSV files.

The Concordance Options allow you to change how the concordance lines are sorted. The default is a random selection of examples from the corpus. Sorting alphabetically by the left or right context can help reveal repeated patterns of word use. The size of the context being displayed can also be changed.

Initially, 1000 concordances will be displayed in the table. Scrolling to the bottom of the page will load the next 1000.

The bar charts and tables below show how much the selected word occurs across texts in the corpus or across rows of a CSV file. This can be used to find out if a word is more frequent in some texts than others. For CSV files, it can be used to see how often a word occurs across the values of each column (i.e. across categories). The summary data for each column is plotted in a bar chart and presented in a table below.

Either the frequency, frequency per million words, number of texts, or percentage of texts per category can be plotted on the bar charts. Frequency per million words, also called relative frequency or normalised frequency, is the default. This is useful when calculating frequency across texts of different lengths, as raw frequencies can be skewed by longer texts. Number/percentage of texts shows, for each value of the column, how many of the texts contain the selected word, but not how many times within the texts. This may be useful when working with lots of small texts, like you might have in a CSV file.

The orientation and order of the bar charts will change based on whether the values are numbers or not. Note that numbers are still treated as individual categories.