What are Search Engines?
Search Engine has become a "generic" term used to
refer to both crawler-based (Google) and directory-based (Yahoo!) site search
companies. The two types of searches produce listing in significantly different
ways and therefore can produce widely varying results. Search engines create
listings automatically by crawling website and compiling information about that
site into an index. When "searching" one of these indices, results
are presented giving emphasis on specific words. The methodology that
determines specific results from this search and delivery is known as an
algorithm. And these algorithms are proprietary and confidential to each search
engine company.
Crawling a Site
Search Engines that use bots or spiders to crawl a site create
references automatically. People typing in "key?€? words find those words
in the sites list of referenced words and the crawler-based sites serve those
pages where those words are referenced.
These kinds of sites like to see site content changed on a
frequent basis as changes are seen as an indication of new and revised
information. A site that uses bots or spiders to crawl a site eventually finds
these changes and that can affect your ranking in their listings. How you lay
out your pages matters. Page titles, your page content, links to your site, and
other items are all important in determining your rank relevance to the words
being referenced.
Crawler-Based Search Engines Dissected
The first of 3 major components to crawler-based search engines
is the spider (or bot) known as the crawler. A bot visits a site, reads the
main words while eliminating all connector words and follows all links it can
find to other pages within the site. If you change one of your web pages,
search engines eventually find those changes, which can affect how you are
listed. Bots return to sites on a schedule known only to the search engine
company, but more frequently to sites that continually update content.
The words found by the bots are placed in a database called an
index or catalog ?€" the second component of a crawler-based search
engine. This is where your keywords for each page indexed by a search engine
will be searched and produces a results page and the order in which that page
will be displayed. If a page is changed, then this index is updated with that
new information the next time a bot visits the site.
New sites and new pages can take some time before a spider finds
and adds the page to the index, and depending on how the site is constructed
and whether or not a spider can "read?€? a page or links to a page, the
page may never be found, read, or indexed. And pages are not immediately added
to the index as soon as they have been crawled. It may take some time to get
from the read stage to the index. Until the content is added to the index, the
content is not available to the search engines.
Search software may be the most important component to the user
in producing relevant results from the indexed content. The software is a
program that searches through billions of pages of indexed content finding
matches and determining relevance to the terms searched. Each search engine
handles this task differently and that is why each may produce varying results.
You may rank near the top on one and be nearly invisible on another.
Internet
search engines are special sites on the Web that are designed to help people
find information stored on other sites. They accept queries supplied by web
users and return a list of resources that best fit the query of the user. There
are differences in the ways various search engines work, but they all perform
three basic tasks:
- They search the Internet -- or
select pieces of the Internet -- based on important words.
- They keep an index of the words
they find, and where they find them.
- They allow users to look for
words or combinations of words found in that index.
Early
search engines held an index of a few hundred thousand pages and documents, and
received maybe one or two thousand inquiries each day. Today, a top search
engine will index hundreds of millions of pages, and respond to tens of
millions of queries per day
A search engine is a coordinated set of programs
that includes:
- A
spider (also called a "crawler" or a "bot") that goes
to every page or representative pages on every Web site that wants to be
searchable and reads it, using hypertext links on each page to discover
and read a site's other pages
- A
program that creates a huge index (sometimes called a "catalog")
from the pages that have been read
- A
program that receives your search request, compares it to the entries in
the index, and returns results to you
Major
components of crawler-based search engines
Crawler-based search
engines have three major components.
1) The crawler: Also
called the spider. The spider visits a web page, reads it, and then follows
links to other pages within the site. The spider will return to the site on a
regular basis, such as every month or every fifteen days, to look for changes.
2) The index: Everything
the spider finds goes into the second part of the search engine, the index. The
index will contain a copy of every web page that the spider finds. If a web
page changes, then the index is updated with new information.
3) The search engine
software: This is the software program that accepts the user-entered query,
interprets it, and sifts through the millions of pages recorded in the index to
find matches and ranks them in order of what it believes is most relevant and
presents them in a customizable manner to the user.
All crawler-based search
engines have the basic parts described above, but there are differences in how
these parts are tuned. That is why the same search on different search engines
often produces different results. Our comparisons will then be based on these
differences in all three parts.
All search
engines contain the following main components:
Spider
|
A browser-like programme that
downloads web pages
|
Crawler
|
A program that automatically follows all of the links on each web page
|
Indexer
|
A program that analyzes web pages downloaded by the spider and the crawler
|
Database
|
Storage for downloaded and
processed pages
|
Results engine
|
Extracts search results from the
database
|
Web server
|
A server that is responsible for
interaction between the user and other search engine components
|
Different types of search engines
When people mention the
term "search engine", it is often used generically to describe both crawler-based
search engines and human-powered directories however these two types of search
engines gather their listings in radically different ways and therefore are
inherently different.
Crawler-based search
engines, create their listings
automatically by using a piece of software to “crawl” or “spider” the web and
then index what it finds to build the search base. Web page changes can be
dynamically caught by crawler-based search engines and will affect how these
web pages get listed in the search results. Crawler-based search engines
are those that use automated software agents (called crawlers) that visit a Web
site, read the information on the actual site, read the site's meta tags and
also follow the links that the site connects to performing indexing on all
linked Web sites as well. The crawler returns all that information back to a
central depository, where the data is indexed. The crawler will periodically
return to the sites to check for any information that has changed. The
frequency with which this happens is determined by the administrators of the
search engine.
Crawler-based search
engines are good when you have a specific search topic in mind and can be very
efficient in finding relevant information in this situation. However, when the
search topic is general, crawler-base search engines may return hundreds of
thousands of irrelevant responses to simple search requests, including lengthy
documents in which your keyword appears only once.
Human-powered
directories, such as the Yahoo directory, Open
Directory and LookSmart, depend on human editors to create their listings.
Typically, webmasters submit a short description to the directory for their
websites, or editors write one for the sites they review, and these manually
edited descriptions will form the search base. Therefore, changes made to
individual web pages will have no effect on how these pages get listed in the
search results. Human-powered search engines rely on humans to submit
information that is subsequently indexed and catalogued. Only information that
is submitted is put into the index.
Human-powered directories
are good when you are interested in a general topic of search. In this situation,
a directory can guide and help you narrow your search and get refined results.
Therefore, search results found in a human-powered directory are usually more
relevant to the search topic and more accurate. However, this is not an
efficient way to find information when a specific search topic is in mind.
Meta-search engines, such as Dogpile, Mamma, and Metacrawler, transmit user-supplied keywords simultaneously to several
individual search engines to actually carry out the search. Search results
returned from all the search engines can be integrated, duplicates can be
eliminated and additional features such as clustering by subjects within the
search results can be implemented by meta-search engines.
Meta-search engines are
good for saving time by searching only in one place and sparing the need to use
and learn several separate search engines. "But since meta-search engines
do not allow for input of many search variables, their best use is to find hits
on obscure items or to see if something can be found using the Internet.
No.
|
Name
|
Language
|
Website
|
1
|
|||
2
|
|||
3
|
|||
4
|
|||
5
|
|||
6
|
|||
7
|
|||
8
|
|||
9
|
Multilingual
|
||
10
|
Multilingual
|
||
11
|
|||
12
|
|||
13
|
Multilingual
|
||
14
|
|||
15
|
|||
17
|
Web Search Engines
Typically, Web search engines
work by sending out a spider to fetch as many documents as possible. Another program, called
anindexer, then reads these documents and creates
an index based on the words contained in each document. Each search
engine uses a proprietary algorithm to create its indices such that, ideally, only meaningful
results are returned for each query.
Definition of:browser rendering engine
Software that renders HTML pages (Web pages). It turns the HTML layout tags in the page into the appropriate commands for the operating system, which causes the formation of the text characters and images for screen and printer. Also called a "layout engine," a rendering engine is used by a Web browser to render HTML pages, by mail programs that render HTML e-mail messages, as well as any other application that needs to render Web page content. For example, Trident is the rendering engine for Microsoft's Internet Explorer, and Gecko is the engine in Firefox. Trident and Gecko are also incorporated into other browsers and applications. Following is a sampling of browsers and rendering engines. See HTML and render.
Rendering
Browser Engine Source
Internet Explorer Trident Microsoft
AOL Explorer Trident Microsoft
Firefox Gecko Mozilla
Netscape Gecko Mozilla
Safari WebKit WebKit
Chrome WebKit WebKit
Opera Presto Opera
Konqueror KHTML KHTML
Building a Search
Searching through an index involves a user building
a query and submitting it through the search engine. The query
can be quite simple, a single word at minimum. Building a more complex query
requires the use of Boolean
operators that allow you to refine and extend the terms of the
search.
The Boolean operators most often seen are:
- AND
- All the terms joined by "AND" must appear in the pages or
documents. Some search engines substitute the operator "+" for
the word AND.
- OR
- At least one of the terms joined by "OR" must appear in the
pages or documents.
- NOT
- The term or terms following "NOT" must not appear in the pages
or documents. Some search engines substitute the operator "-"
for the word NOT.
- FOLLOWED BY
- One of the terms must be directly followed by the other.
- NEAR
- One of the terms must be within a specified number of words of the
other.
- Quotation Marks
- The words between the quotation marks are treated as a phrase, and that
phrase must be found within the document or file.
Human-powered search engines rely on humans to
submit information that is subsequently indexed and catalogued. Only
information that is submitted is put into the index
No comments:
Post a Comment