WebCorp Advanced Wordlist Generator Guide Publications   Feedback
 
Background
How does it work?
 Basic Options
Pattern Matching
Advanced Options: Format
Advanced Options: Concordances
Advanced Options: Domains
Advanced Options: Word Filter
Advanced Options: Date Filter
Advanced Options: Collocation
Advanced Options: Hypertext
Post-Processing
Other Tools

 
How does it work?


The WebCorp interface is similar to the interfaces provided by standard search engines. You enter a word or phrase, choose options from the menus provided and then press the 'Submit' button. WebCorp works 'on top of' the search engine of your choice, taking the list of URLs returned by that search engine and extracting concordance lines from each of those pages - examples of your chosen word or phrase in context. All of the concordance lines are presented on a single results page, with links to the sites from which they came.

How is WebCorp different from search engines?
Search engines, such as Google and AltaVista, are designed to retrieve information from the World Wide Web. They use complex techniques to index the Web and return the documents from their indices which are most relevant for the user's request. WebCorp is designed to retrieve linguistic data from the Web: concordance lines showing the context in which the user's search term occurs. In response to a user query, standard search engines return a list of URLs (page addresses), along with a description of or some text from each page to help the user decide which pages are most useful. To view the pages, the user must click on each of the links individually.

WebCorp actually visits each one of these pages, extracting concordance lines from them. Although some search engines, such as Google, do give Key Word in Context style output for some of the URLs in the results list, this is not true for all of the URLs and not all instances of the search term on each page are given in these short extracts. It may be the case that the search term occurs many times on a given page, but a Google-user could not know this without clicking on each of the links manually. Google is an excellent search engine but it is not designed as a corpus linguistics tool and is not ideal for this purpose. WebCorp contains options (customisable concordance span, output format, etc) specifically designed for linguistic research.

Why is WebCorp slow to return results?
The current version of WebCorp is for demonstration purposes and the speed at which results are returned will increase as the tool is developed further. The reason that WebCorp is slower than search engines is that, although WebCorp has a search engine-like interface, its aims and the way it works are very different.

In order to conduct a full linguistic analysis of how a particular word or phrase is used on the Web, the alternative to using WebCorp would be to use a search engine to find a list of pages containing the word or phrase, and then to access each of the URLs in this list manually, locate each of the examples of the word/phrase on the page and copy these into a file. WebCorp automates this whole process, which is why it is slower than a standard search engine. It is still a vast time-saver over the equivalent manual process.

Next: Basic Options >>
 

 

 © 1999-2008 Research and Development Unit for English Studies   Privacy Policy