User Guide
What is WebCorp?
However large and up-to-date the electronic text corpora available are, there will always be aspects of the language which are too rare or too new to be evidenced in them. WebCorp is a suite of tools which allows access to the World Wide Web as a corpus - a large collection of texts from which facts about the language can be extracted.
What has changed?
Following the shutdown of the Bing Web Search API in August 2025, WebCorp has moved to using the Brave Search API as its main provider for web search results. Brave does not index as much of the web as Google or Bing, so it may be more difficult to find rare examples. However, Brave does provide the inbody operator to limit matches to the body of a webpage, which should make search hits more relevant. We have also taken this opportunity to update the WebCorp user interface, which is described in the following sections.
Who can use WebCorp?
WebCorp can be used by anyone who has an interest in language and how particular words and phrases are used, especially words and phrases which are too new or too rare to appear in any dictionary or standard corpus. Since its launch, WebCorp has been used by corpus linguists, lexicographers, language teachers and learners, publishers, journalists, advertisers, and researchers in a variety of fields. Although WebCorp is designed for linguistic data search, many users have found its results format (with relevant sections of text from multiple web pages collated on one page) useful for information retrieval of the type for which standard search engines are usually used.
Who is responsible for WebCorp?
WebCorp was created and is operated and maintained by the Research and Development Unit for English Studies (RDUES) at Birmingham City University.
How does it work?
The WebCorp interface is similar to the interfaces provided by standard search engines. You enter a word or phrase, choose options from the menus provided and then press the 'Search' button. WebCorp works 'on top of' the search engine of your choice, taking the list of URLs returned by that search engine and extracting concordance lines from each of those pages - examples of your chosen word or phrase in context. All of the concordance lines are presented on a single results page, with links to the sites from which they came.
How is WebCorp different from search engines?
Search engines, such as Google and Bing, are designed to retrieve information from the World Wide Web. They use complex techniques to index the Web and return the documents from their indices which are most relevant for the user's request. WebCorp is designed to retrieve linguistic data from the Web: concordance lines showing the context in which the user's search term occurs. In response to a user query, standard search engines return a list of URLs (page addresses), along with a description of or some text from each page to help the user decide which pages are most useful. To view the pages, the user must click on each of the links individually.
WebCorp actually visits each one of these pages, extracting concordance lines from them. Although some search engines, such as Google, do give Key Word in Context style output for some of the URLs in the results list, this is not true for all of the URLs and not all instances of the search term on each page are given in these short extracts. It may be the case that the search term occurs many times on a given page, but a Google-user could not know this without clicking on each of the links manually. Google is an excellent search engine but it is not designed as a corpus linguistics tool and is not ideal for this purpose. WebCorp contains options (e.g. sorting concordance lines) specifically designed for linguistic research.
Why is WebCorp slow to return results?
The reason that WebCorp is slower than search engines is that, although WebCorp has a search engine-like interface, its aims and the way it works are very different. In order to conduct a full linguistic analysis of how a particular word or phrase is used on the Web, the alternative to using WebCorp would be to use a search engine to find a list of pages containing the word or phrase, and then to access each of the URLs in this list manually, locate each of the examples of the word/phrase on the page and copy these into a file. WebCorp automates this whole process, which is why it is slower than a standard search engine. It is still a vast time-saver over the equivalent manual process.
Search

Search by entering a word or phrase in the 'Search' box and clicking the 'Search' button. You do not need to put quotation marks around the query terms as WebCorp will search for phrases by default.
Case insensitive
A case insensitive search will match both upper case and lower case variants of the search terms.
Search engine
WebCorp works 'on top of' existing web search engines. More specifically, WebCorp connects to search engine APIs (Application Programming Interfaces) to retrieve hits for your search, and then performs its own processing on those hits. This option allows you to specify which search engine API you would like WebCorp to use. Different search engines cover different sub-sets of the Web's content. Currently, the available search engines are Brave (Web or News) and The Guardian Open Platform.
Language
You can choose to perform a search in a specific language. This information is passed onto the web search engine when fetching the initial results. Each search engine handles language differently.
Advanced Search

Site
This option allows you to restrict your search to all of the pages on an individual web site or all sites with a given domain. Note that this only works with the Brave web search. To search all of the pages on an individual site enter the URL without the 'http://' part. For example, enter bbc.co.uk to search all pages on the BBC web site.
To restrict the search to sites within a given domain, enter part of a URL. For example, entering .ac.uk will restrict the search to UK academic institutions, while entering .fr will restrict the search to web sites in France.
To specify more than one domain separate them with spaces or new lines, e.g. 1) 'www.cnn.com www.abc.com' 2) '.net .org'.
You can also specify domains that should not be included in the search results by prefacing them with a minus (-) sign.
Below the Site option there is now a list of frequently used domains. Select one from the list to insert it into the domain box. This list was complied by inspecting WebCorp search logs.
Word Filter
The word filter option allows you to include extra words which must appear or must not appear on the same web pages as your search term. Place a minus sign (-) before words which must not appear.
For example, with the search term 'plant' you may include 'nuclear -flower' as a word filter to restrict the sense of the search term (i.e. only show examples of the word 'plant' from web pages also containing the word 'nuclear' but not the word 'flower').
The words you specify in the word filter option will not necessarily be included in all concordance lines, they must simply appear or not appear on the web page from which the concordance was extracted.
Results

The search results are displayed as concordance lines in KWIC (Key Word In Context) format. WebCorp does this using a table, where the search phrase is displayed in a column down the centre of the page. The words occuring to the left and right are displayed in adjacent columns. This format is useful for finding patterns in the text surrounding the search phrase - useful for linguistic study. The url of the page where the match was found is also displayed.
Sorting
The table columns can be sorted by clicking on the header row. This allows the left and right contexts to sorted alphabetically, also helping with the identification of lexcial patterns. Sorting is dependent on being able to tokenise the text (split it into words), so may not work with all character sets.