How Search Engines Work?

 

It is assumed that search engines start their crawl from seed sites. These are those websites which are manually identified to be most authoritative and trusted like yahoo.com, Microsoft.com, adobe.com etc.  Search engines follow the links on these sites to find other web documents (web pages, images, videos, word documents, PDF files etc) on the web. When a web document is found, search engines crawl it (i.e. parse the code) and then index it (i.e. store certain parts of the web document on hard drives located at different data centers around the world) if it is worth indexing. There are also different levels of indexations. Your document can be stored either permanently or temporarily in the main index.  It can also be stored in the supplemental (or secondary) index or some specialized index (like blog search, image search, product search etc)

 

Here is one thing to keep in mind.  Crawling and indexing are different processes. It is not necessary that if a search engine crawls a web document, then it will also index it. Search Engines don’t index spam documents or documents which are duplicate or very similar to other document. They use a link analysis technique known as ‘Trust Rank’ to separate useful web pages from spam. Search engines store semantically connected documents together in an index (database) known as ‘Latent Semantic Index’ (LSI) for faster retrieval later. I will talk about semantic connectivity in detail, later in the post.

 

Now when a user makes a search query, search engines first retrieve all those documents which are relevant to the search query from LSI and then sort them in decreasing order of their importance.  Both relevance and importance of a web document is determined through a method known as ‘Document Analysis‘ which consist of:

1. Semantic Analysis

2. Link  Analysis (or Citation Analysis)

 

Semantic Analysis

It is done to determine the semantic connectivity between words or phrases i.e. how words/phrases are generally associated with each other. For e.g. ‘Statue of liberty’ is commonly associated with ‘New York’. Similarly, ‘Agra’ is commonly associated with ‘Taj Mahal’. Search engines use different methods to determine semantic connectivity:

i. They use their own dictionaries and thesaurus.

ii. They use Fuzzy Set Theory i.e. search engines measure how words/phrases are used together or how they are used in close proximity and in what context they are used to together.

iii. Topic Modeling – Through this method search engines mathematically try to resolve relationships between words or phrases and if set of contents are relevant to a search query. LDA (Latent Dirichlet Allocation), LSI (Latent Semantic Index), LSA (Latent Semantic Analysis), pLSA (probabilistic Latent Semantic Analysis) etc are all different ways to implement topic modeling.

 

Link (or Citation) Analysis

Search Engines do link analysis to measure the quantity and quality of inbound links (both internal and external link) and citations to a web document. It is also done to separate useful web pages from spam (Trust Rank).

 

If you like this post then you should subscribe to my blog and follow me on twitter.

Related Posts:

 

 

Himanshu Sharma About the Author: is the founder of seotakeaways.com which provides SEO Consulting, PPC Management and Analytics Consulting services to medium and large size businesses. He holds a bachelors degree in ‘Internet Science’, is a member of 'Digital Analytics Association', a Google Analytics Certified Individual and a Certified Web Analyst. He is also the founder of EventEducation.com and EventPlanningForum.net.

My business thrives on referrals, so I really appreciate recommendations to people who would benefit from my help. Please feel free to endorse/forward my LinkedIn Profile to your clients, colleagues, friends and others you feel would benefit from SEO, PPC or Web Analytics.

 

 

  • Amit

    Where do you get such type of information.

    • seo himanshu

      Through reading. No one is born with knowledge. You need to acquire it :)

  • Tim

    Can you elaborate on LSI? I don’t get it much.

  • Lee

    What is the difference between ‘relevance’ and ‘importance’. They mean the same thing to me.

    • seo himanshu

      Relevance is related to the search query. Up to which extent a web document is relevant to the search query. For e.g a web document which talks about ‘Taj Mahal’ is relevant to the search query ‘Taj Mahal history’. Importance is something off-page. Importance is used to determine whether the web document is best document on ‘Taj Mahal’.

  • Jitender

    This is above my head Himanshu. Your knowledge is growing. When you will give me the seo project. I am waiting :)

  • John

    I think search engine working is far more complicated than you have assumed.Anyways, nice try.

    • seo himanshu

      Totally agree with you. I am just scratching the surface here.

  • http://www.article-elf.com/ forex robot

    Keep posting stuff like this i really like it

  • http://www.deanbreaker.com/ Amy

    Keep posting stuff like this i really like it

  • http://www.cardiffquay.com/ Rick

    Keep posting stuff like this i really like it

  • http://bradholton.net Bradley Holton

    A good site with excellent articles. Thanks for such a wonderful informative and entertaining read. Quotations are a great way to inspire you to perform at your best and to remember sage advice from the smartest minds in the world.

    Regards,
    Bradley Holton.