1. FAQ
2.
Strategy
3.
Details
4.Wizard
5.Rankings 
6.Spidap

 

 

 

Spidap,
Page 1

 

 

 

 

 

 

 

Search
FAQ

 

 

 

 

 

 

 

 

 

Search
Strategy

 

 

 

 

 

 

 

 

 

How
Search
Engines
Work

 

 

 

 

 

 

 

 

 

The Web
Search
Wizard

 

 

 

 

 

 

 

 

 

In-Depth
Ratings &
Analysis

 

 

 

 

 

 

 

 

 

 

SmartScape

 

 

 

 

 

 

 

 

 

 

Monash
Information
Services

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Top

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Email

How To Use Web Search Engines

How to get the most from *search engines like *AltaVista, *Infoseek, *Excite, *Webcrawler, *Lycos, HotBot, *Open Text and the *Yahoo Directory.

(starred links are Hyper-Live connections to our Zoom-Inform demo, which provides in-depth information about companies, products and technical terms)

Page 4--How Search Engines Work

Search engines use software robots to survey the Web and build their databases. Web documents are retrieved and indexed.  When you enter a query at a search engine website, your input is checked against the search engine's keyword indices.  The best matches are then returned to you as hits.

There are two primary methods of text indexing--keyword and concept.

Keyword Indexing

This is the most common form of text indexing on the Web.  Most search engines index by keyword. 

Unless the author of the Web document specifies the keywords for her document (this is possible by using meta tags in the latest version of HTML), it's up to the search engine to determine them.  Essentially, this means that search engines pull out words that are believed to be significant. Words that are mentioned towards the top of a document and words that are repeated several times throughout the document are more likely to be deemed important.

Some sites index every word on every page. Others index only part of the document.  For example, Lycos indexes the title, headings, subheadings and the hyperlinks to other sites, along with the first 20 lines of text and the 100 words that occur most often.

Infoseek uses a full-text indexing system, picking up every word in the text except commonly occurring stop words such as "a," "an," "the," "is," "and," "or," and "www."  Hotbot also ignores stop words.  Open Text and AltaVista index all words, even the articles, "a," "an," and "the."  Some of the search engines discriminate upper case from lower case; others store all words without reference to capitalization.

The Problem With Keyword Indexing

Keyword searches have a tough time distinguishing between words that are spelled the same way, but mean something different (i.e. hard cider, a hard stone, a hard exam, and the hard drive on your computer). This often results in hits that are completely irrelevant to your query. Some search engines also have trouble with so-called stemming--i.e., if you enter the word "big," should they return a hit on the word, "bigger?" What about singular and plural words? What about verb tenses that differ from the word you entered by only an "s," or an "ed"?

Search engines also cannot return hits on keywords that mean the same, but are not actually entered in your query. A query on heart disease would not return a document that used the word "cardiac" instead of "heart."

Concept-based indexing

Unlike keyword indexed-based systems, concept-based indexing systems try to determine what you mean, not just what you say.  Essentially, they do this by checking documents for the dominant themes or concepts, which are then indexed. A concept-based search returns hits on documents that are about the subject/theme you're exploring, even if the words in the document don't precisely match the words you enter into the query.  

Excite is currently the best-known general-purpose search engine site on the Web that relies on concept-based searching.

How does it work?  There are various methods of building concept-based indices, some of which are highly complex, relying on sophisticated linguistic and artificial intelligence theory that we won't even attempt to go into here.  Excite sticks to a numerical approach.  Excite's software determines meaning by calculating the frequency with which certain important words appear.  When several words or phrases that are tagged to signal a particular concept appear close to each other in a text, the search engine concludes, by statistical analysis, that the piece is "about" a certain subject.

For example, the word heart, when used in the medical/health context, would be likely to appear with such words as coronary, artery, lung, stroke, cholesterol, pump, blood, attack, and arteriosclerosis.  If the word heart appears in a document with others words such as flowers, candy, love, passion, and valentine, a very different context is established, and the search engine returns hits on the subject of romance.

Warning: This often works better in theory than in practice. Concept-based indexing is a good idea, but it's far from perfect.  The results are best when you enter a lot of words, all of which roughly refer to the concept you're seeking information about.

Try it!

Here's an example of a concept-based query.  Jump to Excite and enter the phrase "cyber love sex and relationships" (don't use the quotation marks). You will get back a lot of documents about love and romance online, even if they don't contain the precise words in your query. On the keyword search engines, you will also get hits, but they will be limited to those that do contain the precise words of your query.

Refining Your Search

Most sites offer two different types of searches--"basic" and "refined."   In a "basic" search, you just enter a keyword without sifting through any pulldown menus of additional options.  Depending on the engine, though, "basic" searches can be quite complex.

Search refining options differ from one search engine to another, but some of the possibilities include the ability to search on more than one word, to give more weight to one search term than you give to another, and to exclude words that might be likely to muddy the results.  You might also be able to search on proper names, on phrases, and on words that are found within a certain proximity to other search terms.  

Some search engines also allow you to specify what form you'd like your results to appear in, and whether you wish to restrict your search to certain fields on the internet (i.e., usenet or the Web) or to specific parts of Web documents (i.e., the title or URL).

Many, but not all search engines allow you to use so-called Boolean operators to refine your search. These are the logical terms AND, OR, NOT, and the so-called proximal locators, NEAR and FOLLOWED BY.

Boolean AND means that all the terms you specify must appear in the document, i.e., "heart" AND "attack."  You might use this if you wanted to exclude common hits that would be irrelevant to your query. 

Boolean OR means that at least one of the terms you specify must appear in the document, i.e., bronchitis, acute OR chronic.  You might use this if you didn't want to rule out too much.

Boolean NOT means that at least one of the terms you specify must not appear in the document. You might use this if you anticipated results that would be totally off-base, i.e., nirvana AND Buddhism, NOT rock, NOT music.

Not quite Boolean + and - Some search engines tuse the characters + and - instead of Boolean operators to include and exclude terms.

NEAR means that the terms you enter should be within a certain number of words of each other.  FOLLOWED BY means that one term just directly follow the other. ADJ, for adjacent, serves the same function. A search engine that will allow you to search on phrases usually, essentially, the same method.

Phrases: The ability to query on phrases is very important in a search engine. Those that allow it usually require that you enclose the phrase in quotation marks, i.e., "space, the final frontier."

Capitalization:  This is essential for searching on proper names of people, companies or products. Unfortunately, many words in English are used both as proper and common nouns--Bill, bill, Gates, gates, Oracle, oracle, Lotus, lotus, Digital, digital--the list is endless.

All the search engines have different methods of refining queries.  The best way to learn them is to read the help files on the search engine sites and practice!

Here are some links to the help files that Spidap finds most useful:

  • Infoseek Search Tips
  • AltaVista Simple Search Help
  • AltaVista Advanced Search Help
  • Excite's Files on Search Refining

  • Relevancy Rankings

    Most of the search engines return results with confidence or relevancy rankings.  In other words, they list the hits according to how closely they think the results match the query.  However, these lists often leave users shaking their heads on confusion, since, to the user, the results often seem completely irrelevant.

    Why does this happen?  Basically it's because search engine technology has not yet reached the point where humans and computers understand each other well enough to communicate clearly.

    Most search engines use search term frequency as a primary way of determining whether a document is relevant.  If you're researching diabetes and the word "diabetes" appears multiple times in a Web document, it's reasonable to assume that the document will contain useful information.  Therefore, a document that repeats the word "diabetes" over and over is likely to turn up near the top of your list.

    If your keyword is a common one, or if it has multiple other meanings, you could end up with a lot of irrelevant hits.  And if your keyword is a subject about which you desire information, you don't need to see it repeated over and over--it's the information about that word that you're interested in, not the word itself.

    Some search engines consider both the frequency and the positioning of keywords to determine relevancy, reasoning that if the keywords appear early in the document, or in the headers, this increases the likelihood that the document is on target.  For example, Lycos ranks hits according to how many times your keywords appear in their indices of the document and in which fields they appear (i.e., in headers, titles or text).  It also takes into consideration whether the documents that emerge as hits are frequently linked to other documents on the Web, reasoning that if other folks consider them important, you should, too.

    If you use the advanced query form on AltaVista, you can assign relevance weights to your query terms before conducting a search.  Although this takes some practice, it essentially allows you to have a stronger say in what results you will get back.

    As far as the user is concerned, relevancy ranking is critical, and becomes more so as the sheer volume of information on the Web grows.  Most of us don't have the time to sift through scores of hits to determine which hyperlinks we should actually explore. The more clearly relevant the results are, the more we're likely to value the search engine.

    New Information On Meta Tags

    Some search engines are now indexing Web documents by the meta tags in the documents' HTML (at the beginning of the document in the so-called "head" tag). What this means is that the Web page author can have some influence over which keywords are used to index the document, and even in the description of the document that appears when it comes up as a search engine hit. Spidap recommends that Web page authors make use of the meta tags.

    What does it all mean?

    You now know more than you probably ever wanted to know about indexing, query refining and relevancy ranking.  How do we put it all together to make Web searching easier and more efficient than it currently is??

    Let's try some practical applications.  It's time for:

    The Web-Searching Wizard

    If you're interested in software companies and products, try our new technology demo, SmartScape (a.k.a. Zoom-Inform)


    The Spider's Apprentice was conceived and written by Linda Barlow, who maintains this site for Monash Information Services. Copyright, 1996-7. All rights reserved.
    Updated: 2/12/97

    E-mail: lindabarlow@monash.com