User Guide


About WebCorp LSE

WebCorp Linguist's Search Engine is a specially tailored search engine for the study of language on the web.

Due to the inadequacies of commercial search engines for studying language on the web it was obvious that a specialised search engine, understanding these needs, is required. The orignial WebCorp (now WebCorp Live) uses commercial search engines to extract results from the web and organises the information for linguistic study. Due to the limitations of this system we developed a fully-tailored linguistic search engine: WebCorp LSE.

WebCorp LSE is powered by our own search engine, developed at Birmingham City University. Our specially-designed web crawler, parser, indexer and other components allow us to cache and process large sections of the web. The new architecture has allowed us to enhance the sentence boundary detection, date identification, 'junk' (or 'boilerplate') removal, collocation and other statistical analysis options currently available in WebCorp Live. Additional pre-processing includes grammatical tagging, lemmatisation and language detection. The search interface enables queries for words, lemmas and phrases, including wildcards and pattern matching.

WebCorp LSE is being updated substantially during 2022. The WebCorp LSE corpora are now annotated using the Stanford Core NLP tools and now include lemma annotations and part-of-speech categories based on the Universal Dependencies framework. In addition, WebCorp LSE now pre-compiles a substantial amount of information for words, lemmas and frequent n-grams. This makes exploring collocation and frequency change-over-time information much faster.

How can I reference WebCorp LSE?

WebCorp LSE is being developed and operated by the Research and Development Unit for English Studies (RDUES) in the School of English at Birmingham City University. WebCorp LSE was launched in 2010, and updated to the current version in 2022. It can be referenced as software or a website, for example:

Research and Development Unit for English Studies. 2022. WebCorp Linguist's Search Engine (version 2). Birmingham City University. Accessed on [date]. https://www.webcorp.org.uk/lse


Search Words and Lemmas

WebCorp LSE now pre-compiles a large amount of information to make exploring word frequency, collocation and change-over-time faster. When you choose the option to Search Words and Lemmas from the corpora page, you will be accessing this pre-compiled information.

Search

The search tab shows a word frequency list for the corpus or sub-corpus selected. This includes token and document frequency and normalised frequencies. You can choose the sub-corpus using the dropdown menu at the top right of the page.

The options above the table allow you to choose between words or lemmas (with POS tags), limit by part-of-speech, exclude stopwords or show frequent n-grams, and search the word list for substrings. E.g. searching the word list for 'run' will show all types which contain the string 'run', with exact matches highlighted at the top of the list. Note that frequent n-grams are those which have a frequency greater than 1 per million words in the corpus.

Selecting a row in the frequency table or clicking on a arrow button on the right-hand side will display further options for the selected word/lemma, including concordances, collocates, word history and collocate comparison.

Concordances

This tab will appear after a word/lemma has been selected. It shows concordances for the word/lemma in Keyword in Context (KWIC) format. If necessary, a progress bar will be shown and once complete randomly sampled concordances will be displayed in a table. The node word be will displayed down the centre of the page with the left and right textual context on either side. The table can be sorted by clicking on the column names. Hovering over a word will show a tooltip containing grammatical information. Note that the random sample will remain the same each time the page is loaded, but more concordances can be loaded using the button at the bottom of the table. Other columns included in the concordances show the date, sub-corpus, website and a link to the link to the webpage, but note that the webpage may have changed since the data was collected.

Collocates

This tab will appear after a word/lemma has been selected. It shows frequency information for collocates of the selected node word/lemma. Collocates are words which co-occur in close proximity to a node word. WebCorp LSE pre-compiles span-based collocates at spans of 1 (immediately adjacent to the node) and 4 words. The table includes left frequency and right frequency (total number of co-occurrences either side of the node), total co-occurrence frequency and collocation score. You can choose the size of the span, exclude stopwords or filter by part-of-speech and change the measure by which the collocates are ordered.

Word History

This tab will appear after a word/lemma has been selected. It shows frequency information concerning the year and month in which the selected word/lemma occurs. A time-series graph shows normalised frequencies per month and a moving average. Mouse over the graph to see more information and drag across the graph to select a date range. Clicking the Search button will show concordances within the selected date range. The graph also includes options to display a trend analysis (using Cox's sequential test for trend), sudden jump analysis (based on changes in local means), and components of a time-series decomposition, including the seasonal component and remaining random/residual component. More details on these tests are forthcoming in the publication below.

Below the time-series graph are two more graphs which show normalised frequencies per month. The first graph shows the monthly means. The second graph shows the difference between the means and the smoothed frequency (i.e. the seasonal decomposition component).

The change-over-time analysis is based on our research during and following the APRIL project. For more information please see:

Renouf, A. 2012. A Finer Definition of Neology in English: the life-cycle of a word. In Corpus Perspectives on Patterns of Lexis, edited by H. Hasselgård, O.E. Signe and E. Jarle, 177-208. Amsterdam/Philadelphia: John Benjamins.

Kehoe, A., M. Gee and A. Renouf. (forthcoming). A data-driven approach to finding significant changes in language use through time series analysis. In Language in time, time in language, edited by S. Flach and M. Hilpert. Amsterdam/Philadelphia: John Benjamins.

Compare

This tab enables the collocates of two words/lemmas to be compared. You can select the words for comparison using the arrow buttons on the right-hand side of the frequency tables mentioned above. The arrow buttons will open a dropdown including options for Compare (word 1) and Compare (word 2).

The comparison results include three lists of words. The left-hand list shows the words that collocate with word 1 but not with word 2, the middle list shows the words which collocate with both word 1 and word 2, and the right-hand list shows words which do not collocate with word 1 but do with word 2. You can choose the span and whether to filter by stopwords or part-of-speech using the options above the lists.

The compare option is based on our research during the Repulsion project. For more information please see:

Renouf, A. and J. Banerjee. 2007. The search for repulsion: a new corpus analytical approach. In Towards Multimedia in Corpus Studies, edited by P. Pahta, I. Taavitsainen, T. Nevalainen and J. Tyrkkö. Electronic publication, University of Helsinki.

Renouf, A. and J. Banerjee. 2007. Lexical repulsion between sense related pairs. International Journal of Corpus Linguistics, 12(3), 415-443.

Renouf, A. and J. Banerjee. 2008. The phenomenon of lexical repulsion in text. Lingvisticare Investigationes, 31(2), 213-225.


Perform a search by entering your search string in the Query input box then click the Search button. You will first be shown a progress bar. Once the search is finished, concordances and other summary data will be available to view. There are a number of supporting search options that can be chosen, see below.

Query Terms

Any of the following can be entered as queries.

runWord search.
RUNLemma search. Use full caps to search for lemmas.
run/ranAlternative words. Either 'run' or 'ran' will be matched.
r?nAny one character in the ? position. E.g. possible matches are 'run' or 'ran'.
run*
+ing
anti*tion
Character wildcards. The specified string with any number of characters in the * position will be matched. E.g. possible matches of run* include 'run', 'runs', 'running' and 'rune'. Matches one or more characters if + is used.
sign~upCompounds without or without hyphens. E.g. possible matches include 'signup', 'sign-up' and 'sign up'.
too many cooks spoilPhrase search.
too many * spoilPhrase containing a wildcard. In this case any or no word can appear in the * position.
too many + spoilPhrase containing a wildcard. In this case any word can appear in the + position.
too many !cooks spoilNot operator. In this case any word other than 'cooks' can appear in the ! position.
too many *3 spoil
too many +3 spoil
Phrase containing a multiple wildcard. In this case up to three words can appear in the *3 or +3 position. The maximum length of this wildcard is 9 words.
carpe diem || seize the dayAlternative query operator. Find different phrases with one search.
Part-of-Speech Search

Part-of-Speech (POS) tags can be added to queries by placing them inside curly brackets, e.g. noun is {NOUN} (see tagsets below).

record{VERB}POS tag attached to a word. Only instances of 'record' as the specified POS type (in this case a verb) will be matched.
record/tape{VERB}
run*{VERB}
POS tag attached to a pattern. Each word matched by the pattern must also match the specified POS type.
{VERB}
they {VERB} it
POS tag only. Any verb will be matched in the {VERB} position.

The same searches can performed using Penn Treebank POS tags by specifying Penn= inside double curly brackets. E.g. a past tense verb would be specified with {{Penn=VBD}}.

The dropdown menus below the Query box list the available tags. Selecting a tag from these lists will add it to the query.

Search Options

These options are available on the Advanced Search page and can be used to change the scope of your search:

Sub-corpusChose a sub-corpus from the dropdown list to limit the search to that subsection of the corpus. See the details for each corpus to learn more about the sub-corpora.
Case insensitiveTick this to match case variants of your search words.
Within a single sentenceTick this to prevent phrasal queries from spanning sentence boundaries.
Date from and Date toEnter a date in the form YYYY-MM to limit the range of dates searched. E.g. 2008-08 specifies the year 2008 and month August.
Sites filterEnter website domain names that the search much match. Add a minus (-) infront of the domain to exclude that domain. Substrings of domain names can be used. E.g. .co.uk -.com
Word FilterThis enables you to specify words which must or must not appear within the same document as your search term. Place a minus (-) before words which must not appear. E.g. if you search for the word plant you can limit the documents matched by entering nuclear -flower. Only results from documents containing the word 'nuclear' but not the word 'flower' will returned.

Advanced Search Results

After starting your search you will be shown a progress bar. Once the search is finished concordances and other summary data will be available in the following tabs:

Concordances

This tab displays concordances in Keyword in Context (KWIC) format. The matches of your query will be displayed down the centre of the page with the left and right textual context on either side. Hovering over a word will show a tooltip containing grammatical information. Clicking on the match will open an extended concordance view in a popup window. Above the concordance table you will find an option to either view randomly sampled concordances (which can be helpful when there are many results) or page through the full set of concordances. The random concordances table can be sorted by clicking on the column names. Options to extend the list of concordances or move between pages are shown at the bottom of the page. Note that the random sample will remain the same each time the page is loaded. Other columns included in the concordances show the date, sub-corpus, website and a link to the link to the webpage, but note that the webpage may have changed since the data was collected.

Sub-corpora

This tab shows how many matches occurred in each of the sub-corpora in a table, including token and document frequencies, normalised frequencies (relative to the size of each sub-corpus) and proportions (relative to the number of search results).

Clicking one of the arrow buttons on the right-hand side of the table creates a copy of the search limited to that sub-corpus.

Dates

This tab provides frequency information concerning the year and month in which the results occur. A time-series graph shows normalised frequencies per month and a moving average. Mouse over the graph to see more information and drag across the graph to select a date range from which a new search can be constructed.

Below the graph, a table provides information about how many matches occurred in each month, including token and document frequencies, normalised frequencies (relative to the (sub-)corpus size for each month) and proportions (relative to the number of search results).

Sites

This tab shows how many times a website occurred in the results in a table, including token and document frequencies, normalised frequencies (relative to the total frequency of the site in the (sub-)corpus) and proportions (relative to the number of search results).

Selecting sites using the tick boxes on the right-hand side then clicking the arrow button in the table header creates a copy of the search to match only the selected sites.

Match Summary

This provides frequency information for the words and lemmas which matched your query. The matches are shown in a table including token and document frequencies, normalised frequencies (relative to the size of the (sub-)corpus searched) and proportions (relative to the number of search results). Note that it takes time to compile a summary of the matches, so the first time you use this tab you will need to wait for the summary information to be generated.

You can choose between displaying matching words or matching lemmas (with POS tags). Selecting matches using the tick boxes on the right-hand side then clicking the arrow button in the table header creates a new search using only the selected matches.

Collocates

This tab shows a table of collocates. Collocates are words which co-occur in close proximity to a node word (in this case the node being the query matches). The table includes frequency information for each position left and right up to a span of 5 words. It also includes the document frequency (the number of documents in which the co-occurrence within the selected span occurred), left frequency and right frequency (total number of co-occurrences either side of the node), total co-occurrence frequency and collocation score. Note that it takes time to compile a summary of the collocates, so the first time you use this tab you will need to wait for the summary information to be generated.

You can choose between displaying matching words or matching lemmas (with POS), the size of the span, whether left, right or both sides are displayed, exclude stopwords or specify a POS and change the measure by which the collocates are ordered. Selecting collocates using the tick boxes on the right-hand side, then clicking the arrow button in the table header will create a new search containing the selected collocates.


See also

The main part-of-speech (POS) categories are based on the Universal Dependencies framework:

ADJAdjective
ADPPreposition/Postposition
ADVAdverb
CCONJCoordinating Conjunction
DETDeterminer
INTJInterjection
NOUNNoun
NUMNumber
PARTParticle
PRONPronoun
PROPNProper Noun
VERBVerb

It is also possible to search for Penn Treebank POS tags through the advanced search system.

CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb

The stopword list used in WebCorp LSE filters out numbers, single characters the following 'grammatical' words:

'd, 'll, 'm, 're, 's, 've, a, about, above, across, after, against, all, along, alongside, also, although, always, am, amid, amidst, among, amongst, an, and, any, anybody, anyone, anything, anywhere, apropos, are, aren't, arent, around, as, at, atop, be, because, been, before, behind, being, below, beneath, beside, besides, between, beyond, both, but, by, can, can't, cannot, cant, cos, could, couldn't, couldnt, coz, d, dare, daren't, darent, despite, did, didn't, didnt, do, does, doesn't, doesnt, doing, don't, done, dont, dr, during, each, either, else, every, everybody, everyone, everything, everywhere, except, few, for, from, go, going, had, hadn't, hadnt, has, hasn't, hasnt, have, haven't, havent, having, he, he'd, he'll, he's, hed, hell, her, here, hers, herself, hes, him, himself, his, how, however, i, i'd, i'll, i'm, i've, id, if, ill, im, in, inside, into, is, isn't, isnt, it, it'd, it'll, it's, itd, itll, its, itself, ive, less, like, ll, m, make, many, may, mayn't, maynt, me, might, mine, minus, more, most, mr, mrs, much, must, mustn't, mustnt, my, myself, n't, needn't, neednt, neither, never, nevertheless, no, no-one, nobody, none, nonetheless, noone, nor, not, nothing, notwithstanding, nt, of, off, often, on, one, only, or, other, ought, oughtn't, oughtnt, our, ours, ourselves, out, outside, over, part, per, plus, rather, re, s, shall, shan't, shant, she, she'd, she'll, she's, shed, shell, shes, should, shouldn't, shouldnt, since, so, some, somebody, someone, someplace, something, sometime, sometimes, somewhere, t, than, that, that'd, that'll, that's, thatd, thatll, thats, the, thee, their, theirs, them, themselves, then, there, there'd, there'll, there's, there've, thered, therefore, therell, theres, thereve, therewith, these, they, they'd, they'll, they're, they've, theyd, theyll, theyre, theyve, thine, this, those, thou, though, through, throughout, thus, thy, till, to, too, toward, towards, under, underneath, until, up, upon, us, ve, very, via, was, wasn't, wasnt, we, we'd, we'll, we're, wed, well, well, were, were, what, what'd, what'll, what's, what've, whatd, whatever, whatll, whats, whatsoever, whatve, when, whenever, where, wherever, whether, which, whichever, while, whilst, who, whom, whose, why, will, with, within, without, won't, wont, would, wouldn't, wouldnt, ye, yeah, yes, you, you'd, you'll, you're, you've, youd, youll, your, youre, yours, yourself, yourselves, youve