What is a web crawler? Explained in simple terms!
A web crawler is a software tool that visits websites and gathers information from the web pages.
The Quant CLI crawler is a handy tool that lets you easily copy a website and get it into Quant’s static website hosting.
The Quant crawler is a handy tool that lets you easily copy a website. This feature has been available via the Quant CLI since last year, but we recently did quite a few updates (see the April 2021 changelog for more details). For a quick-and-dirty usage example, check out How to freeze and take static archive of your old site. In this post, we’ll go into more detail about using the Quant CLI crawler.
Web crawlers are very useful for a number of reasons. Our What is a web crawler? blog post has a bunch broken down, but we generally find that many people are interested in the following:
Archival: If you are using a CMS or backend web technology and don’t want to support it anymore for whatever reason and don’t need content updates, you can archive your site and then just host the static archived version and decommission the old tech.
Failover: If you want to be able to switch to a static copy of your site if your website goes down, you can regularly crawl your site and keep a version on hand for this. For example, you might crawl daily. The static content will be slightly stale but it’s better than having no site!
Backups: This is pretty much the same as having failover copies. One nice thing about doing regularly snapshots is that you can see the content revisions in Quant.
If you are a developer, then command-line tools are your best friends. For non-developers, we will be adding a way to run the crawler in the dashboard at some point. Let us know if this is something you are interested in.
Quant CLI can be a very useful tool when working with Quant. Here are some resources to learn more:
The Quant crawler leverages simplecrawler which is great web crawler! Once you have the CLI setup (checkout the resources above), it’s easy to use Quant’s crawler. The simplest case is outlined in the "Use the quant-cli tool to crawl your website" section of How to freeze and take static archive of your old site. Note, you do need to include http/https in your domain.
quant crawl https://yourdomain.com
Simplecrawler has a ton of configuration options, but we’ve simplified things a lot to support the most common use cases. Let’s touch upon the crawler parameters you can use:
Maps to simplecrawler’s maxConcurrency
and defaults to 4. If you want more threads to run at the same time, you can update this. This can be useful if you have a large site and want it to be crawled faster. But, this will impact your site more so be careful with this setting. Example:
quant crawl --concurrency 10 https://yourdomain.com
Cookies are a normal part of web browsing but often you don’t need cookies for crawling. Often cookies are used for tracking and analytics and can slow things down, so we default to false for saving cookies. If your website requires cookies to be saved to display the content correctly, you can enable this option.
quant crawl --cookies true https://yourdomain.com
Added to the simplecrawler’s domainWhitelist
along with the main domain. If you want to include additional domains when crawling, you can with this option. This might be helpful if you have more than one or more associated websites. For example, a main site (www.example.com) along with a blog site (blog.example.com) and you want to combine them. You provide a comma-delimited list of domains without http/https.
quant crawl --extra-domains “blog.yourdomain.com,shop.yourdomain.com” https://yourdomain.com
You want to make sure that you aren’t causing undue stress on the website you are crawling. The default is 200ms but you can tweak this to your needs. For example, if you wanted to increase the timing to 1 second, you could use the following. Knowing your website’s size, traffic, and load profile will help you understand if you want to change this.
quant crawl --interval 1000 true https://yourdomain.com
This one is pretty self-explanatory and defaults to false. If you don’t want to be prompted while the crawler is running, you can enable this option. This is helpful when running the crawler in a cron or other automated fashion.
quant crawl --no-interaction true https://yourdomain.com
If you want the host domains stripped out of the URLs in the content, you can set this option. The paths will be relative, e.g. https://yourdomain.com/some/content/link => /some/content/link
quant crawl --rewrite true https://yourdomain.com
Maps to simplecrawler’s respectRobotsTxt
. Sometimes, the robots.txt
file or page’s <meta>
tags might be prohibitive for crawling which may prevent content or assets from being copied. This is why the crawler’s default for this setting is false. If you want to honor the robots.txt
file and <meta>
configuration, you can change this to true.
quant crawl --robots true https://yourdomain.com
If you want 404 pages to be saved when resources aren’t found during the crawl, you can use this option. Then you can check the Quant dashboard for these to see what content you need to fix.
quant crawl --seed-notfound true https://yourdomain.com
Maps to simplecrawler’s maxResourceSize
and defaults to 256MB. If you want to increase the buffer size, you can override this. For example, for 512MB, you could use:
quant crawl --size 536870912 https://yourdomain.com
Sometimes you just want to start over! We do this all the time when we restart our computer or browser or whatever when there is a glitch. You can also restart your crawl from scratch using this option.
quant crawl --skip-resume true https://yourdomain.com
If you want to just crawl a subset of URLs from your website, you can create a JSON file with your list and feed that into the crawler. This is nice for testing out the crawler as well as situations where you don’t need the whole website synced. Note, you have to use skip-resume for this option to take effect. For example,
quant crawl --skip-resume true --urls-file myurls.json https://yourdomain.com
Quant is a global static edge; a CDN combined with static web hosting. We provide solutions to help make WordPress and Drupal sites static, as well as support for all popular static site generators.
A web crawler is a software tool that visits websites and gathers information from the web pages.
Legacy sites can be a major burden to maintain. Use Quant to take a static snapshot of your old website to start saving money, stop maintaining your old servers and software, and remove your risk of security vulnerabilities. Archive your legacy content easily.
An introduction to the Quant CLI tool to show how deployments can be introduced into current CI/CD pipelines to achieve automated Quant deployments. First in a series that will explore integrations and examples with popular CI cloud providers.
Complete trial with CDN, WAF, Crawler, static integration and support.
Cancel anytime.