How Google SEARCH Works
Here is a quick
overview
Google's
"Inside Search" Presentation
How
Search Works (Google engineer)
How
Google Works (HowStuffWorks)
Google
Guide - Google Explained in Text
Here are some questions to answer about Google:
- Was Google the "first" search engine? If not, then who
was and what happened to them?
http://www.wordstream.com/articles/internet-search-engines-history
Archie was "first" - used to Archive info - 1990 until 1998 - there is a working version - you can't use it
Archie was actually an FTP catalogue, not a Web search engine (did change to that later)
Something like EXCITE is probably closer to the "first WEB search engine"
http://www.salientmarketing.com/seo-resources/search-engine-history/grandfather-search-engine.html
Have a look at Excite - www.excite.com
- Google has lots of servers - how many? Where are
they? How big are they?
http://www.extremetech.com/extreme/161772-microsoft-now-has-one-million-servers-less-than-google-but-more-than-amazon-says-ballmer
Nobody knows the exact number, as Google tries to keep it secret.
But it's probably more than 1 million. They are stored in SERVER FARMS -
over 30 large warehouses each containing 50,000 servers or so,
maintained by dozens of technicians, and using large amounts of electricity.
In the past, each server was basically a normal Intel PC.
Now they are slowly changing to "blades", which are more powerful
but use less electricity. But this stuff is all secret, so we don't know for sure.
- If you make a web-site (or web-page), will it appear in
Google?
Immediately? Later? Do you need to pay for this
privilege?
http://www.dummies.com/how-to/content/timing-googles-crawl.html
http://www.boostsuite.com/2012/03/01/how-long-does-it-take-for-a-new-site-to-appear-in-the-google-search-results/
If you have a real web-site, for example http://fiscomp.weebly.com , then you can do an experiment.
Type in a sentence like "Google servers ran Mac OSX in 1970." (nonsense of course)
Then search for that exact sentence (in quotes). See how many days it takes
until your sentence appears in Google. Make sure that the sentence does NOT
appear in Google when you start.
In general, Google claims that they "crawl" the entire Web every month or so.
So you can expect your sentence to appear within a month.
Some very popular, large, fast changing sites - like CNN.com - are crawled more often, probably daily.
No, you cannot pay Google to list your page (although you can make a free request).
You CAN pay Google to list your page as an AD, and then you pay them
to diplay it in searches - but that can be very expensive, like thousands of Euros.
- Google's search engine is "free". How can Google afford
to offer this for free?
http://www.channel4.com/news/if-google-is-free-how-does-it-make-so-much-money
ADS. Google makes a lot of profit by displaying ADS on search result pages.
They charge "per click" or "clicks per day".
Google ads are popular because they are "targetted" - that means that you will
see ads that are more likely to appeal to you. For example, teenagers are unlikely
to see ads for diapers, but people who have searched for "baby clothes" are more
likely to receive diaper ads.
- How many people use Google each day? How many searches
are done each day?
http://www.internetlivestats.com/google-search-statistics/
Somewhere between 1 billion and 5 billion searches daily, obviously changes from day to day.
Google gets around 75% of all the search requests in the world.
So if there are around 3 billion Web users total, but not all are on line each day
and not everyone uses Google every day, so something like a 100 million Google users daily.
Unique Google visitors each month = 187 million per month
- What is SEO? How can you force your page to be at the
top of Google's search results?
http://www.redevolution.com/seo-explained/
http://www.redevolution.com/what-is-seo/
SEO = Search Engine Optimization. This is about constructing web-pages in such a way
that they will appear on the first page of a Google search result.
In the past, this was done by putting lots of interesting KEYWORDS in the <HEAD> section
of the web-page. For example, if you look at the page source, you will find pages with "money" written
lots of times, to attract users who have searched for something about "money and banks".
Now the SEO "tricks" are a lot more sophisticated. Web developers charge money
to give advice about improving SEO.
http://www.seomark.co.uk/how-does-google-rank-websites/
- What is a "spider" or "crawler"? How does it work?
http://www.googleguide.com/google_works.html
Google's servers must "crawl" (examine) "all" the web-pages in the entire web (visible pages) -
at least that is the goal. The servers run "spider" programs that "visit" a web-page, then follow
all the links on that page, and then follow all the links on those pages (recursively).
Eventually the spider must stop - maybe after "10 iterations" - and then "return home".
At each web-page, the spider does some or all of the following:
- make a list of all the words appearing on that page
- save the list of all the words (or important words) in the INDEX servers,
along with the URL of the page
- make a copy of the page and store it in Google's "cache"
- the spider doesn't do this, but some part of Google makes a list of
all the pages that LINK TO the current page
- When you search for something, like "where can I get free
stuff online",
how does Google decide which results to put on the first page?
http://www4.ncsu.edu/~ipsen/ps/slides_recruit.pdf
http://cristersmedia.com/how-do-i-get-my-website-on-the-first-page-of-google
Google's original ALGORITHM for search result placement was PAGE-RANK.
This originally depended on
- how many of the search terms actually appear in the page, and how often
- how many other pages LINK TO this page
- how popular are the other linking pages
Now there are more like 200 "metrics" (measurements) that determine search result placement.
Here is a clear, detailed explanation:
http://www2.curriculum.edu.au/scis/connections/issue_58/how_does_google_collect_and_rank_results.html
- Does Google "know" what you like? Do they give different
search results
to different people? How is that possible?
http://searchengineland.com/google-now-personalizes-everyones-search-results-31195
Google "remembers" what items you clicked on, as well as what things you searched for,
over a period of time (up to 180 days). They use COOKIES to store the history
inside your computer. Each time you visit google, the google web-page retrieves the
information from these cookies, then uses that information to help determine the order
of search results, as well as deciding which ads to display on your search-results-page.
- Describe several different ways of measuring web-page
"popularity".
http://webdesign.about.com/od/analytics/qt/what_to_track.htm
You need to read the article above. A summary won't help much.
- What Operating System is used on Google's servers? Why?
http://lwn.net/Articles/357658/
That's an easy one - Google uses Linux on their servers. But they modify it to do
EXACTLY what they need and to OPTIMIZE performance. They don't use
any of the "standard" Linux distributions, although it is widely believed that
the Google system is based on Debian (but it's a trade secret, so no guarantees).
Since the servers need no user interface, they are only using the Linux KERNEL,
which is a small part of the total OS. So saying they "use Linux" is a bit misleading.
What is definitely true is that they are NOT using Windows or Mac OS.
- What is an "index"? How does Google make an index?
http://socialwebqanda.com/2011/10/how-does-google-process-search-queries-so-fast/
http://www.youtube.com/watch?v=KyCYyoGusqs
Read and watch the summaries above.
The very short version - a search index is similar to the index in the back of the book.
It contains a list of words, with the addresses (like page numbers) of corresponding web-pages.
But the Google Web Index contains a lot more information, like the proximity of two
words on a page. So if you search for "world champion", the top results will contain
"world" and "champion" right next too each other, while lower-down search results
may contain "world" and "champion", but far away from each other.
- Does Google index the whole web, or only part of it?
Why?
http://venturebeat.com/2013/03/01/how-google-searches-30-trillion-web-pages-100-billion-times-a-month/
http://www.quora.com/What-percentage-of-the-web-does-Google-index-and-how-has-it-changed-over-time
Best guess = 5% = Surface Web. The rest is the "deep web".
Although popular myths claim that the Deep Web is full of criminal activities,
there are actually lots of normal, legal reasons for "hiding" a web-page:
- contains personal information, maybe personal data like a passport photo
- contains information that is useless to most people, like the record of all
the students entering and leaving a school building
- confidential information like names of students in a school, which could
be considered dangerous if the student might be a target for terrorists
- banking information, like your current bank balance
- your personal calendar - it is just nobody else's business
- copyrighted material that you are allowed to store and wish
to store "in the cloud", but should not be available for other people
Most of this info is hidden behind passwords, but can also be protected
by a firewall. For example, our school blocks FTP access at our firewall.
These protections prevent Google from crawling the protected web-pages.
- Describe how search engines functioned in the 1990s.
Compare this to how Google's search engine works.
http://www.wordstream.com/articles/internet-search-engines-history
Have a look at http://www.excite.com, which started in 1994 and is still active.
Those old search engines were largely maintained "by hand", meaning that
human beings recorded links in the search index database. There was little automation.
Modern search engines are almost 100% automated - using spiders to crawl the web -
because there is just too much data that needs to be organized.
- Describe how search engine effectiveness could improve in the
future.
http://www.search-marketing.info/future-of-search-engines/
http://www.mycommerce.com/blog/item/452-what-will-search-engines-look-like-in-10-years
That's an easy one - Semantic Web. We will study this in the near future.
This question is about the INTERNAL efficiency of computer systems, not the user interface.
So a prettier or easier interface is NOT an improvement in this sense.