Google crawls the web to find new pages. Google then indexes these pages to understand what they are and ranks them based on the information obtained. Crawling and indexing are two different processes, but both are performed by a spider.
In guide, we've put together everything an SEO professional needs to know about spiders. Read on to learn what Google's spider is, how it works, and how you can make it interact more efficiently with your website.
Google's Spider (also Searched, Spider) is software that Google and other search engines use to crawl the web. Simply put, I "browse" the web from page to page, looking for new or updated content that Google doesn't already have in its databases.
Every search engine has its own bots. When it comes to Google, there are more than 15 different species of spiders, and Google's staple is called the Googlebot. Googlebot does the crawling and indexing, so let's take a closer look at how it works.
How does Google's spider work?
Google (or any other search engine) does not have a central URL register that is updated every time a new page is created. This means that Google is not automatically "notified" about new pages, but has to find them online. Googlebot constantly scans the Internet for new pages and adds them to Google's database of existing pages.
When Googlebot sees a new page, it displays it in the browser and loads it with HTML, third-party code, JavaScript, and CSS. This information is stored in the search engine's database and then used to index and rank the page. When a page is indexed, it is added to Google's index - another very large Google database.
Crawling, rendering, indexing
How does the Google crawler see pages?
Google's crawler displays the page in the latest version of the Chromium browser. In a perfect scenario, Google's crawler "sees" the page as you designed and built it. In a realistic scenario, things may turn out to be more complicated.
Desktop
and a Googlebot smartphone. This section is needed to index the page for desktop and mobile SERPs.
A few years ago, Google used a computer crawler to visit and display most pages. But things changed with the introduction of the “mobile first” concept. Google thought the world was mobile enough and started using the Googlebot smartphone to crawl, index and rank mobile versions of websites for mobile and desktop SERPs.
However, adopting mobile-first indexing has proven more difficult than expected. The internet is huge and most websites seem to be poorly optimized for mobile devices. That's why Google introduced a concept of finding content and indexing new and old websites that is fully optimized for mobile devices. If the site isn't mobile-friendly, Googlebot Desktop will crawl it and display it immediately.
Even if your site has switched to mobile-first indexing, Googlebot Desktop will still crawl some of your pages because Google wants to see how your site performs on desktop. Google doesn't directly say it will index your desktop version if it differs significantly from the mobile version. Still, it is logical to assume that Google's main goal is to provide users with the most useful information. And Google doesn't want to lose that data by blindly following the mobile-first concept.
Note! In any case, your website will be visited by Googlebot Mobile and Google Bot Desktop. Because of this, it's important to take care of both versions of your website and consider using a responsive layout if you haven't already.
How do you know if Google is searching for content and indexing your site using a mobile-first concept? You will receive a special message in Google Search Console.
Google Search Console Mobile-First Crawling
Source: Search Engine Land
HTML and JavaScript Rendering
Googlebot may have some issues with processing and editing code at scale. If your page code is messy, the crawler may not be able to render it correctly and will consider your page blank.
When playing JavaScript, remember that JavaScript is a rapidly evolving language and sometimes Googlebot does not support the latest versions. Make sure your JS is compatible with Googlebot, otherwise your page may not display correctly.
Note the JavaScript loading time. If a script takes more than 5 seconds to load, Googlebot won't display or index the content generated by that script.
Note! If your website is full of heavy JavaScript elements and you can't live without them, Google recommends server-side rendering. This will make your website load faster and avoid JavaScript errors.
To see which resources are causing display problems on your page (and find out if you're having problems), log in to your Google Search Console account, go to URL Inspection, enter the URL you want to check , click Test Live URL and click View Tested Page.
the tested page in Google Search Console.
Then go to the More Information section and click the Messages folders on the Resources page and the JavaScript Console to see a list of resources that Googlebot can't access to return .
Assets could not be rendered
. You can now show webmasters a list of issues and ask them to investigate and fix bugs so Googlebot can render content correctly.
What influences spider behavior?
Google Boots' behavior is not chaotic - it is governed by sophisticated algorithms that help the spider navigate the web and set rules for how it processes information.
However, the behavior of Google's algorithms is not something to just throw around and hope for the best. Let's take a closer look at what influences crawler behavior and how you can optimize your page likes.
If Google already knows your site, Googlebot regularly checks whether your main pages are up to date. Because of this, it's important to place links to new pages on real pages on your site. Best on the home page.
You can enrich the homepage with the latest news or blog post block even if you have separate news and blog pages. This would allow Googlebot to find your new pages much faster. This recommendation might seem pretty obvious, but many website owners continue to overlook it, resulting in poor indexing and ranking.
When it comes to crawling, backlinks work the same way - Google will find your page faster if it's linked from a credible and popular external site. So when you add a new page, don't forget about external marketing. You can try guest posting, running an ad campaign, or otherwise getting Googlebot to see your new URL.
Note: Links must be dofollow for Googlebot to follow them. Although Google recently stated that nofollow links can also be used as crawling and indexing tips, we still recommend using dofollow. Just to make sure google crawlers see the page.
Click sounds
indicate how far the page is from the home page and tell you how many steps it takes Googlebot to reach the page. Ideally, all pages of the website should be accessed with 3 clicks. A loud clicking noise slows down crawling and hardly benefits the user experience.
With a Website Auditor you can check if your website is having click noise issues. Start the tool and go to Site Structure -> Pages and pay attention to the ``Click Depth'' column.
Website Auditor Measures Click Depth
If you find that some important pages are too far from the home page, check the structure of your website. A good structure should be simple and scalable so that you can add as many new pages as you like without negatively affecting click depth and preventing Google's crawler from successfully reaching the pages.
simple and extensible structure
A sitemap
is a document containing a complete list of pages that should appear on Google. You can submit your sitemap to Google via the Google Search Console (Index -> Sitemaps) so that the Googlebot knows which pages to visit and crawl. The sitemap also tells Google if there are any updates to our pages.
Note! A sitemap does not guarantee that Googlebot will use it to index your site. The spider can ignore your sitemap and continue exploring the site at will. Despite this, nobody was penalized for the sitemap, and in most cases it turned out to be useful. Some CMS even automatically create, update and submit a sitemap to Google to make your SEO process faster and easier. Consider submitting a sitemap if your site is new or large (has more than 500 URLs).
WebSite Auditor allows you to create a sitemap. Go to Settings -> XML Sitemap Settings -> Sitemap Creation and set the desired options. Name the sitemap (sitemap filename) and upload it to your computer to submit to Google or publish on your website (publish sitemap).
When searching and indexing your pages, Google follows certain guidelines such as robots.txt, noindex tag, robots meta tag and x-robots markup.
Robots.txt is a root directory file that restricts certain pages or pieces of content to Google. When Googlebot discovers your page, it consults the robots.txt file. When robots.txt prevents a detected page from being crawled, Googlebot stops collecting and loading the page's content and scripts. This page will not appear in searches.
You can create a robots.txt file in WebSite Auditor (Settings -> Robots.txt Settings).
txt robot settings with the
website crawler's noindex tag, meta robot tag, and x robots tag are tags used to prevent spiders from crawling and indexing the page. The noindex flag prevents the page from being indexed by all types of robots. The robot's meta tag is used to specify how a particular page should be crawled and indexed. This means you can block certain types of spiders from visiting your website and leave them open to others. The X-Robots tag can be used as an HTTP header response element that can prevent the page from being indexed or control the crawling behavior of the page. On this day, you can target specific car types (if specified). If the robot type is not specified, the instructions apply to all types of Google robots.
Note! A robots.txt file does not guarantee that a page will be excluded from indexing. Googlebot treats this document as a recommendation rather than an order. This means that Google can ignore the robots.txt and index the page for search. If you want to ensure that the page is not indexed, use the noindex tag.
Are all pages usable?
no Some pages may not be available to be searched and indexed by Google. Let's take a closer look at these types of pages:
Password protected pages. Googlebot simulates the behavior of an anonymous user who has no knowledge of visiting protected pages. So if the page is password protected, it won't be crawled because Googlebot can't access it.
Pages excluded by indexing directives. These are Google's hidden pages with robots.txt directives, unindexed pages, robots meta tag and x-robots tag.
Remove the orphans. Orphan pages are pages that are not linked to any other page on the site. Googlebot is a robotic spider, which means it discovers new pages by following any links it finds. If there are no links pointing to the page, the page is not examined and will not appear in searches.
Crawling and indexing of specific pages can be intentionally restricted. These are usually pages that you do not want to appear in searches: personal information pages, policies, terms of service, trial pages, archive pages, internal pages with search results, etc.
if you want your pages to be accessible to Google's crawlers and bring you traffic, make sure you're not password protecting public pages and links (internal and external), and carefully review your indexing policies.
To check the searchability of your website's page content in Google Search Console, go to Index -> Coverage Report. Look for issues marked as error (not indexed) and valid with warning (indexed even if there are issues).
Google Search Console Coverage Report Learn more
about crawling and indexing issues and how to fix them in our comprehensive Google Search Console guide.
We can also perform a more in-depth indexing audit using the WebSite Auditor program. The tool not only shows you problems with pages that are available for indexing, but also shows you pages that Google hasn't seen yet. Launch the software and navigate to Site Structure -> Site Audit.
Website review with Website Reviewer
Notice! If you don't want Googlebot to find or update pages (some old pages, pages you don't need anymore), remove them from your sitemap if you have them, set the status to 404 Not Found or tag them with the noindex tag .
When will my site appear in search?
Obviously, your pages won't show up in searches immediately after your site is published. When your website is brand new, it takes a while for Googlebot to find it online. Note that this "something" can take up to 6 months in some cases.
If Google already knows your site and you've made updates or added new pages, the speed at which site changes appear live depends on your crawl budget.
The school budget is the amount of resources Google spends crawling your site's content. The more resources it takes Googlebot to find your site's content, the slower it will appear in searches.
How you determine your crawl budget depends on the following factors:
Site popularity. The more popular a website is, the more crawl points Google is willing to spend on crawling it.
refresh rate. The more often you update your pages, the more crawling resources your site will have.
page number. The more pages you have, the bigger your crawl budget will be.
The server's ability to handle crawling. Your hosting servers must be able to respond to spider requests in a timely manner.
Note that crawl budget is not spent evenly on each page as some pages consume more resources (due to advanced JavaScript and CSS or due to messy HTML). The crawl budget allocated may not be enough to crawl all of your pages as quickly as expected.
Besides major code issues, problems with duplicate content and poorly structured URLs are some of the most common causes of poor content research and irrational content research budgets.
Duplicate Problem
Duplicate content consists of multiple pages with essentially similar content. This can be for many reasons, such as:
accessing the site in different ways: with or without www, through http or https;
Dynamic URLs – when many different URLs point to the same page;
A/B Test Page Versions.
If duplicate content issues are not resolved, Googlebot will crawl the same page multiple times, thinking they are different pages. This wastes crawling resources and may prevent Googlebot from finding other important pages on your site. Additionally, duplicate content lowers your page rank in search queries as Google may decide that the overall quality of your site is low.
The truth is that in most cases, you can't get rid of most of the things that cause duplicate content. However, you can avoid duplicate content issues by specifying canonical URLs. The canonical tag signals which page should be considered the "main page," so Google won't index the rest of the URLs that point to the same page, and your content won't be duplicated. You can also use the robots.txt file to prevent search robots from accessing dynamic URLs.
Friendly URLs are valued by both humans and machine algorithms. Googlebot is no exception. Googlebot can get confused trying to understand long URLs with many parameters. And the more "confused" Googlebot is, the more crawling resources are spent on one page.
To avoid wasting your crawl budget, make sure your URLs are user-friendly. User-friendly (and Googlebot) URLs are clear, follow a logical structure, have correct punctuation, and don't contain complex parameters. In other words, your URLs should look like this:
http://example.com/vegetables/cucumbers/pickles
Note! Luckily, optimizing your crawl budget isn't as complicated as it sounds. But the truth is, you only need to worry about this if you have a large (1 million+ pages) or medium (10,000+ pages) website with content that changes frequently (daily or weekly). Other times, you just need to optimize the site properly for search and fix indexing issues in a timely manner.
Conclusion
Google's main spider, the Googlebot, uses sophisticated algorithms, but you can still "navigate" its behavior to the benefit of your website. Additionally, most of the optimization steps in the crawling process repeat common SEO steps that we are all familiar with.