WebCorp - the web as corpus
Improvements to WebCorp: A tailored linguistic search engine for accessing the web as corpus

We are currently developing a fully-tailored linguistic search engine which will result in significant improvements in speed and an increase in the range of processing options available.

The current version of WebCorp relies on standard web search engines such as Google and AltaVista, adding layers of refinement specifically for linguistic analysis. This process is inherently slower than a search of a fully indexed and pre-processed database or corpus and, as WebCorp has grown to serve hundreds of thousands of users worldwide, speed has become more of an issue.

The new version of WebCorp is powered by our own search engine, developed at Birmingham City University. Our specially-designed web crawler, parser, tokeniser, indexer and other components allow us to cache and process large sections of the web, updating this corpus at regular intervals.

The new architecture has allowed us to enhance the sentence boundary detection, date identification, 'junk' (or 'boilerplate') removal, collocation and other statistical analysis options currently available in WebCorp. Additional pre-processing includes grammatical tagging and language detection, and full pattern matching and wildcard search.

The new WebCorp Linguist's Search Engine is currently being tested by members of the corpus linguistics community.

Keep up to date with developments
Join our Mailing List
Enter your name and email address below:
Name:
Email:
Subscribe  Unsubscribe 

Back to WebCorp

© 1999-2007 Research and Development Unit for English Studies