User Guide
About WebCorp LSE
WebCorp Linguist's Search Engine is a specially tailored search engine for the study of language on the web.
Due to the inadequacies of commercial search engines for studying language on the web it was obvious that a specialised search engine, understanding these needs, is required. The orignial WebCorp (now WebCorp Live) uses commercial search engines to extract results from the web and organises the information for linguistic study. Due to the limitations of this system we developed a fully-tailored linguistic search engine: WebCorp LSE.
WebCorp LSE is powered by our own search engine, developed at Birmingham City University. Our specially-designed web crawler, parser, indexer and other components allow us to cache and process large sections of the web. The new architecture has allowed us to enhance the sentence boundary detection, date identification, 'junk' (or 'boilerplate') removal, collocation and other statistical analysis options currently available in WebCorp Live. Additional pre-processing includes grammatical tagging, lemmatisation and language detection. The search interface enables queries for words, lemmas and phrases, including wildcards and pattern matching.
WebCorp LSE is being updated substantially during 2022. The WebCorp LSE corpora are now annotated using the Stanford Core NLP tools and now include lemma annotations and part-of-speech categories based on the Universal Dependencies framework. In addition, WebCorp LSE now pre-compiles a substantial amount of information for words, lemmas and frequent n-grams. This makes exploring collocation and frequency change-over-time information much faster.
How can I reference WebCorp LSE?
WebCorp LSE is being developed and operated by the Research and Development Unit for English Studies (RDUES) in the School of English at Birmingham City University. WebCorp LSE was launched in 2010, and updated to the current version in 2022. It can be referenced as software or a website, for example:
Research and Development Unit for English Studies. 2022. WebCorp Linguist's Search Engine (version 2). Birmingham City University. Accessed on [date]. https://www.webcorp.org.uk/lse
Search Words and Lemmas
WebCorp LSE now pre-compiles a large amount of information to make exploring word frequency, collocation and change-over-time faster. When you choose the option to Search Words and Lemmas from the corpora page, you will be accessing this pre-compiled information.
Search
The search tab shows a word frequency list for the corpus or sub-corpus selected. This includes token and document frequency and normalised frequencies. You can choose the sub-corpus using the dropdown menu at the top right of the page.
The options above the table allow you to choose between words or lemmas (with POS tags), limit by part-of-speech, exclude stopwords or show frequent n-grams, and search the word list for substrings. E.g. searching the word list for 'run' will show all types which contain the string 'run', with exact matches highlighted at the top of the list. Note that frequent n-grams are those which have a frequency greater than 1 per million words in the corpus.
Selecting a row in the frequency table or clicking on a arrow button on the right-hand side will display further options for the selected word/lemma, including concordances, collocates, word history and collocate comparison.
Concordances
This tab will appear after a word/lemma has been selected. It shows concordances for the word/lemma in Keyword in Context (KWIC) format. If necessary, a progress bar will be shown and once complete randomly sampled concordances will be displayed in a table. The node word be will displayed down the centre of the page with the left and right textual context on either side. The table can be sorted by clicking on the column names. Hovering over a word will show a tooltip containing grammatical information. Note that the random sample will remain the same each time the page is loaded, but more concordances can be loaded using the button at the bottom of the table. Other columns included in the concordances show the date, sub-corpus, website and a link to the link to the webpage, but note that the webpage may have changed since the data was collected.
Collocates
This tab will appear after a word/lemma has been selected. It shows frequency information for collocates of the selected node word/lemma. Collocates are words which co-occur in close proximity to a node word. WebCorp LSE pre-compiles span-based collocates at spans of 1 (immediately adjacent to the node) and 4 words. The table includes left frequency and right frequency (total number of co-occurrences either side of the node), total co-occurrence frequency and collocation score. You can choose the size of the span, exclude stopwords or filter by part-of-speech and change the measure by which the collocates are ordered.
Word History
This tab will appear after a word/lemma has been selected. It shows frequency information concerning the year and month in which the selected word/lemma occurs. A time-series graph shows normalised frequencies per month and a moving average. Mouse over the graph to see more information and drag across the graph to select a date range. Clicking the Search button will show concordances within the selected date range. The graph also includes options to display a trend analysis (using Cox's sequential test for trend), sudden jump analysis (based on changes in local means), and components of a time-series decomposition, including the seasonal component and remaining random/residual component. More details on these tests are forthcoming in the publication below.
Below the time-series graph are two more graphs which show normalised frequencies per month. The first graph shows the monthly means. The second graph shows the difference between the means and the smoothed frequency (i.e. the seasonal decomposition component).
The change-over-time analysis is based on our research during and following the APRIL project. For more information please see:
Renouf, A. 2012. A Finer Definition of Neology in English: the life-cycle of a word. In Corpus Perspectives on Patterns of Lexis, edited by H. Hasselgård, O.E. Signe and E. Jarle, 177-208. Amsterdam/Philadelphia: John Benjamins.
Kehoe, A., M. Gee and A. Renouf. (forthcoming). A data-driven approach to finding significant changes in language use through time series analysis. In Language in time, time in language, edited by S. Flach and M. Hilpert. Amsterdam/Philadelphia: John Benjamins.
Compare
This tab enables the collocates of two words/lemmas to be compared. You can select the words for comparison using the arrow buttons on the right-hand side of the frequency tables mentioned above. The arrow buttons will open a dropdown including options for Compare (word 1) and Compare (word 2).
The comparison results include three lists of words. The left-hand list shows the words that collocate with word 1 but not with word 2, the middle list shows the words which collocate with both word 1 and word 2, and the right-hand list shows words which do not collocate with word 1 but do with word 2. You can choose the span and whether to filter by stopwords or part-of-speech using the options above the lists.
The compare option is based on our research during the Repulsion project. For more information please see:
Renouf, A. and J. Banerjee. 2007. The search for repulsion: a new corpus analytical approach. In Towards Multimedia in Corpus Studies, edited by P. Pahta, I. Taavitsainen, T. Nevalainen and J. Tyrkkö. Electronic publication, University of Helsinki.
Renouf, A. and J. Banerjee. 2007. Lexical repulsion between sense related pairs. International Journal of Corpus Linguistics, 12(3), 415-443.
Renouf, A. and J. Banerjee. 2008. The phenomenon of lexical repulsion in text. Lingvisticare Investigationes, 31(2), 213-225.
Advanced Search
Perform a search by entering your search string in the Query input box then click the Search button. You will first be shown a progress bar. Once the search is finished, concordances and other summary data will be available to view. There are a number of supporting search options that can be chosen, see below.
Query Terms
Any of the following can be entered as queries.
run | Word search. |
RUN | Lemma search. Use full caps to search for lemmas. |
run/ran | Alternative words. Either 'run' or 'ran' will be matched. |
r?n | Any one character in the ? position. E.g. possible matches are 'run' or 'ran'. |
run* | Character wildcards. The specified string with any number of characters in the * position will be matched. E.g. possible matches of run* include 'run', 'runs', 'running' and 'rune'. Matches one or more characters if + is used. |
sign~up | Compounds without or without hyphens. E.g. possible matches include 'signup', 'sign-up' and 'sign up'. |
too many cooks spoil | Phrase search. |
too many * spoil | Phrase containing a wildcard. In this case any or no word can appear in the * position. |
too many + spoil | Phrase containing a wildcard. In this case any word can appear in the + position. |
too many !cooks spoil | Not operator. In this case any word other than 'cooks' can appear in the ! position. |
too many *3 spoil | Phrase containing a multiple wildcard. In this case up to three words can appear in the *3 or +3 position. The maximum length of this wildcard is 9 words. |
carpe diem || seize the day | Alternative query operator. Find different phrases with one search. |
Part-of-Speech Search
Part-of-Speech (POS) tags can be added to queries by placing them inside curly brackets, e.g. noun is {NOUN} (see tagsets below).
record{VERB} | POS tag attached to a word. Only instances of 'record' as the specified POS type (in this case a verb) will be matched. |
record/tape{VERB} | POS tag attached to a pattern. Each word matched by the pattern must also match the specified POS type. |
{VERB} | POS tag only. Any verb will be matched in the {VERB} position. |
The same searches can performed using Penn Treebank POS tags by specifying Penn= inside double curly brackets. E.g. a past tense verb would be specified with {{Penn=VBD}}.
The dropdown menus below the Query box list the available tags. Selecting a tag from these lists will add it to the query.
Search Options
These options are available on the Advanced Search page and can be used to change the scope of your search:
| Sub-corpus | Chose a sub-corpus from the dropdown list to limit the search to that subsection of the corpus. See the details for each corpus to learn more about the sub-corpora. |
| Case insensitive | Tick this to match case variants of your search words. |
| Within a single sentence | Tick this to prevent phrasal queries from spanning sentence boundaries. |
| Date from and Date to | Enter a date in the form YYYY-MM to limit the range of dates searched. E.g. 2008-08 specifies the year 2008 and month August. |
| Sites filter | Enter website domain names that the search much match. Add a minus (-) infront of the domain to exclude that domain. Substrings of domain names can be used. E.g. .co.uk -.com |
| Word Filter | This enables you to specify words which must or must not appear within the same document as your search term. Place a minus (-) before words which must not appear. E.g. if you search for the word plant you can limit the documents matched by entering nuclear -flower. Only results from documents containing the word 'nuclear' but not the word 'flower' will returned. |
Advanced Search Results
After starting your search you will be shown a progress bar. Once the search is finished concordances and other summary data will be available in the following tabs:
Concordances
This tab displays concordances in Keyword in Context (KWIC) format. The matches of your query will be displayed down the centre of the page with the left and right textual context on either side. Hovering over a word will show a tooltip containing grammatical information. Clicking on the match will open an extended concordance view in a popup window. Above the concordance table you will find an option to either view randomly sampled concordances (which can be helpful when there are many results) or page through the full set of concordances. The random concordances table can be sorted by clicking on the column names. Options to extend the list of concordances or move between pages are shown at the bottom of the page. Note that the random sample will remain the same each time the page is loaded. Other columns included in the concordances show the date, sub-corpus, website and a link to the link to the webpage, but note that the webpage may have changed since the data was collected.
Sub-corpora
This tab shows how many matches occurred in each of the sub-corpora in a table, including token and document frequencies, normalised frequencies (relative to the size of each sub-corpus) and proportions (relative to the number of search results).
Clicking one of the arrow buttons on the right-hand side of the table creates a copy of the search limited to that sub-corpus.
Dates
This tab provides frequency information concerning the year and month in which the results occur. A time-series graph shows normalised frequencies per month and a moving average. Mouse over the graph to see more information and drag across the graph to select a date range from which a new search can be constructed.
Below the graph, a table provides information about how many matches occurred in each month, including token and document frequencies, normalised frequencies (relative to the (sub-)corpus size for each month) and proportions (relative to the number of search results).
Sites
This tab shows how many times a website occurred in the results in a table, including token and document frequencies, normalised frequencies (relative to the total frequency of the site in the (sub-)corpus) and proportions (relative to the number of search results).
Selecting sites using the tick boxes on the right-hand side then clicking the arrow button in the table header creates a copy of the search to match only the selected sites.
Match Summary
This provides frequency information for the words and lemmas which matched your query. The matches are shown in a table including token and document frequencies, normalised frequencies (relative to the size of the (sub-)corpus searched) and proportions (relative to the number of search results). Note that it takes time to compile a summary of the matches, so the first time you use this tab you will need to wait for the summary information to be generated.
You can choose between displaying matching words or matching lemmas (with POS tags). Selecting matches using the tick boxes on the right-hand side then clicking the arrow button in the table header creates a new search using only the selected matches.
Collocates
This tab shows a table of collocates. Collocates are words which co-occur in close proximity to a node word (in this case the node being the query matches). The table includes frequency information for each position left and right up to a span of 5 words. It also includes the document frequency (the number of documents in which the co-occurrence within the selected span occurred), left frequency and right frequency (total number of co-occurrences either side of the node), total co-occurrence frequency and collocation score. Note that it takes time to compile a summary of the collocates, so the first time you use this tab you will need to wait for the summary information to be generated.
You can choose between displaying matching words or matching lemmas (with POS), the size of the span, whether left, right or both sides are displayed, exclude stopwords or specify a POS and change the measure by which the collocates are ordered. Selecting collocates using the tick boxes on the right-hand side, then clicking the arrow button in the table header will create a new search containing the selected collocates.
See also
- The corpora available in WebCorp LSE and how they were built
- Research and Development Unit for English Studies
The main part-of-speech (POS) categories are based on the Universal Dependencies framework:
| ADJ | Adjective |
| ADP | Preposition/Postposition |
| ADV | Adverb |
| CCONJ | Coordinating Conjunction |
| DET | Determiner |
| INTJ | Interjection |
| NOUN | Noun |
| NUM | Number |
| PART | Particle |
| PRON | Pronoun |
| PROPN | Proper Noun |
| VERB | Verb |
It is also possible to search for Penn Treebank POS tags through the advanced search system.
| CC | Coordinating conjunction |
| CD | Cardinal number |
| DT | Determiner |
| EX | Existential there |
| FW | Foreign word |
| IN | Preposition or subordinating conjunction |
| JJ | Adjective |
| JJR | Adjective, comparative |
| JJS | Adjective, superlative |
| LS | List item marker |
| MD | Modal |
| NN | Noun, singular or mass |
| NNS | Noun, plural |
| NNP | Proper noun, singular |
| NNPS | Proper noun, plural |
| PDT | Predeterminer |
| POS | Possessive ending |
| PRP | Personal pronoun |
| PRP$ | Possessive pronoun |
| RB | Adverb |
| RBR | Adverb, comparative |
| RBS | Adverb, superlative |
| RP | Particle |
| SYM | Symbol |
| TO | to |
| UH | Interjection |
| VB | Verb, base form |
| VBD | Verb, past tense |
| VBG | Verb, gerund or present participle |
| VBN | Verb, past participle |
| VBP | Verb, non-3rd person singular present |
| VBZ | Verb, 3rd person singular present |
| WDT | Wh-determiner |
| WP | Wh-pronoun |
| WP$ | Possessive wh-pronoun |
| WRB | Wh-adverb |
The stopword list used in WebCorp LSE filters out numbers, single characters the following 'grammatical' words:
'd, 'll, 'm, 're, 's, 've, a, about, above, across, after, against, all, along, alongside, also, although, always, am, amid, amidst, among, amongst, an, and, any, anybody, anyone, anything, anywhere, apropos, are, aren't, arent, around, as, at, atop, be, because, been, before, behind, being, below, beneath, beside, besides, between, beyond, both, but, by, can, can't, cannot, cant, cos, could, couldn't, couldnt, coz, d, dare, daren't, darent, despite, did, didn't, didnt, do, does, doesn't, doesnt, doing, don't, done, dont, dr, during, each, either, else, every, everybody, everyone, everything, everywhere, except, few, for, from, go, going, had, hadn't, hadnt, has, hasn't, hasnt, have, haven't, havent, having, he, he'd, he'll, he's, hed, hell, her, here, hers, herself, hes, him, himself, his, how, however, i, i'd, i'll, i'm, i've, id, if, ill, im, in, inside, into, is, isn't, isnt, it, it'd, it'll, it's, itd, itll, its, itself, ive, less, like, ll, m, make, many, may, mayn't, maynt, me, might, mine, minus, more, most, mr, mrs, much, must, mustn't, mustnt, my, myself, n't, needn't, neednt, neither, never, nevertheless, no, no-one, nobody, none, nonetheless, noone, nor, not, nothing, notwithstanding, nt, of, off, often, on, one, only, or, other, ought, oughtn't, oughtnt, our, ours, ourselves, out, outside, over, part, per, plus, rather, re, s, shall, shan't, shant, she, she'd, she'll, she's, shed, shell, shes, should, shouldn't, shouldnt, since, so, some, somebody, someone, someplace, something, sometime, sometimes, somewhere, t, than, that, that'd, that'll, that's, thatd, thatll, thats, the, thee, their, theirs, them, themselves, then, there, there'd, there'll, there's, there've, thered, therefore, therell, theres, thereve, therewith, these, they, they'd, they'll, they're, they've, theyd, theyll, theyre, theyve, thine, this, those, thou, though, through, throughout, thus, thy, till, to, too, toward, towards, under, underneath, until, up, upon, us, ve, very, via, was, wasn't, wasnt, we, we'd, we'll, we're, wed, well, well, were, were, what, what'd, what'll, what's, what've, whatd, whatever, whatll, whats, whatsoever, whatve, when, whenever, where, wherever, whether, which, whichever, while, whilst, who, whom, whose, why, will, with, within, without, won't, wont, would, wouldn't, wouldnt, ye, yeah, yes, you, you'd, you'll, you're, you've, youd, youll, your, youre, yours, yourself, yourselves, youve