How Google SEARCH Works

Here is a quick overview

Google's "Inside Search" Presentation

How Search Works (Google engineer)

How Google Works (HowStuffWorks)

Google Guide - Google Explained in Text

Here are some questions to answer about Google:

  1. Was Google the "first" search engine?  If not, then who was and what happened to them?
    http://www.wordstream.com/articles/internet-search-engines-history
    Archie was "first" - used to Archive info - 1990 until 1998 - there is a working version - you can't use it
    Archie was actually an FTP catalogue, not a Web search engine (did change to that later)
    Something like EXCITE is probably closer to the "first WEB search engine"
    http://www.salientmarketing.com/seo-resources/search-engine-history/grandfather-search-engine.html
    Have a look at Excite -  www.excite.com
  2. Google has lots of servers - how many?  Where are they?  How big are they?
    http://www.extremetech.com/extreme/161772-microsoft-now-has-one-million-servers-less-than-google-but-more-than-amazon-says-ballmer
    Nobody knows the exact number, as Google tries to keep it secret.
    But it's probably more than 1 million.  They are stored in SERVER FARMS -
    over 30 large warehouses each containing 50,000 servers or so,
    maintained by dozens of technicians, and using large amounts of electricity.
    In the past, each server was basically a normal Intel PC. 
    Now they are slowly changing to "blades", which are more powerful
    but use less electricity.  But this stuff is all secret, so we don't know for sure.
  3. If you make a web-site (or web-page), will it appear in Google?
    Immediately?  Later?  Do you need to pay for this privilege?
    http://www.dummies.com/how-to/content/timing-googles-crawl.html
    http://www.boostsuite.com/2012/03/01/how-long-does-it-take-for-a-new-site-to-appear-in-the-google-search-results/
    If you have a real web-site, for example  http://fiscomp.weebly.com , then you can do an experiment.
    Type in a sentence like "Google servers ran Mac OSX in 1970." (nonsense of course)
    Then search for that exact sentence (in quotes).  See how many days it takes
    until your sentence appears in Google.  Make sure that the sentence does NOT
    appear in Google when you start.
    In general, Google claims that they "crawl" the entire Web every month or so.
    So you can expect your sentence to appear within a month.
    Some very popular, large, fast changing sites - like CNN.com - are crawled more often, probably daily.
    No, you cannot pay Google to list your page (although you can make a free request).
    You CAN pay Google to list your page as an AD, and then you pay them
    to diplay it in searches - but that can be very expensive, like thousands of Euros.
  4. Google's search engine is "free".  How can Google afford to offer this for free?
    http://www.channel4.com/news/if-google-is-free-how-does-it-make-so-much-money
    ADS.  Google makes a lot of profit by displaying ADS on search result pages.
    They charge "per click" or "clicks per day". 
    Google ads are popular because they are "targetted" - that means that you will
    see ads that are more likely to appeal to you.  For example, teenagers are unlikely
    to see ads for diapers, but people who have searched for "baby clothes" are more
    likely to receive diaper ads.
  5. How many people use Google each day?  How many searches are done each day?
    http://www.internetlivestats.com/google-search-statistics/
    Somewhere between 1 billion and 5 billion searches daily, obviously changes from day to day.
    Google gets around 75% of all the search requests in the world.
    So if there are around 3 billion Web users total, but not all are on line each day
    and not everyone uses Google every day, so something like a 100 million Google users daily.
    Unique Google visitors each month = 187 million per month
  6. What is SEO?  How can you force your page to be at the top of Google's search results?
    http://www.redevolution.com/seo-explained/    http://www.redevolution.com/what-is-seo/
    SEO = Search Engine Optimization.  This is about constructing web-pages in such a way
    that they will appear on the first page of a Google search result.
    In the past, this was done by putting lots of interesting KEYWORDS in the <HEAD> section
    of the web-page.  For example, if you look at the page source, you will find pages with "money" written
    lots of times, to attract users who have searched for something about "money and banks".
    Now the SEO "tricks" are a lot more sophisticated.  Web developers charge money
    to give advice about improving SEO.
    http://www.seomark.co.uk/how-does-google-rank-websites/
  7. What is a "spider" or "crawler"?  How does it work?
    http://www.googleguide.com/google_works.html
    Google's servers must "crawl" (examine) "all" the web-pages in the entire web (visible pages) -
    at least that is the goal.  The servers run "spider" programs that "visit" a web-page, then follow
    all the links on that page, and then follow all the links on those pages (recursively).
    Eventually the spider must stop - maybe after "10 iterations" - and then "return home".
    At each web-page, the spider does some or all of the following:
    - make a list of all the words appearing on that page
    - save the list of all the words (or important words) in the INDEX servers,
       along with the URL of the page
    - make a copy of the page and store it in Google's "cache"
    - the spider doesn't do this, but some part of Google makes a list of
      all the pages that LINK TO the current page
  8. When you search for something, like "where can I get free stuff online",
    how does Google decide which results to put on the first page?
    http://www4.ncsu.edu/~ipsen/ps/slides_recruit.pdf
    http://cristersmedia.com/how-do-i-get-my-website-on-the-first-page-of-google
    Google's original ALGORITHM for search result placement was PAGE-RANK.
    This originally depended on
    -  how many of the search terms actually appear in the page, and how often
    -  how many other pages LINK TO this page
    -  how popular are the other linking pages
    Now there are more like 200 "metrics" (measurements) that determine search result placement.
    Here is a clear, detailed explanation:
    http://www2.curriculum.edu.au/scis/connections/issue_58/how_does_google_collect_and_rank_results.html
  9. Does Google "know" what you like?  Do they give different search results
    to different people?  How is that possible?
    http://searchengineland.com/google-now-personalizes-everyones-search-results-31195
    Google "remembers" what items you clicked on, as well as what things you searched for,
    over a period of time (up to 180 days).  They use COOKIES to store the history
    inside your computer.  Each time you visit google, the google web-page retrieves the
    information from these cookies, then uses that information to help determine the order
    of search results, as well as deciding which ads to display on your search-results-page.
  10. Describe several different ways of measuring web-page "popularity".
    http://webdesign.about.com/od/analytics/qt/what_to_track.htm
    You need to read the article above.  A summary won't help much.
  11. What Operating System is used on Google's servers?  Why?
    http://lwn.net/Articles/357658/
    That's an easy one - Google uses Linux on their servers.  But they modify it to do
    EXACTLY what they need and to OPTIMIZE performance.  They don't use
    any of the "standard" Linux distributions, although it is widely believed that
    the Google system is based on Debian (but it's a trade secret, so no guarantees).
    Since the servers need no user interface, they are only using the Linux KERNEL,
    which is a small part of the total OS.  So saying they "use Linux" is a bit misleading.
    What is definitely true is that they are NOT using Windows or Mac OS.
  12. What is an "index"?  How does Google make an index?
    http://socialwebqanda.com/2011/10/how-does-google-process-search-queries-so-fast/
    http://www.youtube.com/watch?v=KyCYyoGusqs
    Read and watch the summaries above.
    The very short version - a search index is similar to the index in the back of the book.
    It contains a list of words, with the addresses (like page numbers) of corresponding web-pages.
    But the Google Web Index contains a lot more information, like the proximity of two
    words on a page.  So if you search for "world champion", the top results will contain
    "world" and "champion" right next too each other, while lower-down search results
    may contain "world" and "champion", but far away from each other.
  13. Does Google index the whole web, or only part of it?  Why?
    http://venturebeat.com/2013/03/01/how-google-searches-30-trillion-web-pages-100-billion-times-a-month/
    http://www.quora.com/What-percentage-of-the-web-does-Google-index-and-how-has-it-changed-over-time
    Best guess = 5% = Surface Web.  The rest is the "deep web". 
    Although popular myths claim that the Deep Web is full of criminal activities,
    there are actually lots of normal, legal reasons for "hiding" a web-page:
    - contains personal information, maybe personal data like a passport photo
    - contains information that is useless to most people, like the record of all
       the students entering and leaving a school building
    - confidential information like names of students in a school, which could
       be considered dangerous if the student might be a target for terrorists
    - banking information, like your current bank balance
    - your personal calendar - it is just nobody else's business
    - copyrighted material that you are allowed to store and wish
       to store "in the cloud", but should not be available for other people
    Most of this info is hidden behind passwords, but can also be protected
    by a firewall.  For example, our school blocks FTP access at our firewall.
    These protections prevent Google from crawling the protected web-pages.
  14. Describe how search engines functioned in the 1990s.
    Compare this to how Google's search engine works.
    http://www.wordstream.com/articles/internet-search-engines-history
    Have a look at http://www.excite.com, which started in 1994 and is still active.
    Those old search engines were largely maintained "by hand", meaning that
    human beings recorded links in the search index database.  There was little automation.
    Modern search engines are almost 100% automated - using spiders to crawl the web -
    because there is just too much data that needs to be organized.
  15. Describe how search engine effectiveness could improve in the future.
    http://www.search-marketing.info/future-of-search-engines/
    http://www.mycommerce.com/blog/item/452-what-will-search-engines-look-like-in-10-years
    That's an easy one - Semantic Web.  We will study this in the near future.
    This question is about the INTERNAL efficiency of computer systems, not the user interface.
    So a prettier or easier interface is NOT an improvement in this sense.