Crazy Dev Blog

Technology Posts, Automation, Programming, Web Scraping, Coding tutorials and more!

Increase your scraping speed with Go and Colly! — The Basics

Let’s scrape Amazon to see how fast this can be. But first, let’s learn about the basics.

Post illustration

Introduction

In this article, we’ll explore the power of Go(lang). We’ll see how to create a scraper able to get basic data about products on Amazon.
The goal of this scraper will be to fetch an Amazon result page, loop through the different articles, parse the data we need, go to the next page, write the results in a CSV file and… repeat.

In order to do this, we’ll use a library called Colly. Colly is a scraping framework written in Go. It’s lightweight but offers a lot of functionalities out of the box such as parallel scraping, proxy switcher, etc.

This article will cover the basics of the Colly framework. In the next one, we’ll go more in details and implement improvements/optimizations for the code we’ll be writing today.

Let’s inspect Amazon to determine the CSS selectors

Amazon result's page
Here is how Amazon’s result page looks like

From this page, we would like to extract the name, the rating (stars) and the price for each product appearing in the result’s page.

We can notice that all the pieces of information we need for each product are in this area:

Amazon's result product

With the help of the Google Chrome Inspector, we can determine that the CSS selector for those elements is “div.a-section.a-spacing-medium”. Now, we just have to determine the selectors for the name, the stars, and the price. All of those can be found thanks to the inspector. Here are the results:

Name: span.a-size-medium.a-color-base.a-text-normal
Stars: span.a-icon-alt
Price: span.a-price > span.a-offscreen

Those selectors are not perfect: we will see later that we’ll encounter some edge cases where we’ll need to format the values we extracted. But for now, we can work with that.

The selector of the results list itself is “div.s-result-list.s-search-results.sg-row”. So the logic for our scraper will be: “For each product in the results list, fetch its name, stars, and price”

We’ll also handle the pagination in another section. For now, we can just see that the URL of the results page looks like this

https://www.amazon.com/s?k={search-term}&ref=nb_sb_noss_1

In our case:

https://www.amazon.com/s?k=nintendo+switch&ref=nb_sb_noss_1

It is now time to implement what we found out in Go with the help of Colly.

Go & Colly implementation

Let’s create our Collector !

In Colly, you need first to implement a Collector. A Collector will give you access to some methods allowing you to trigger callback functions when a certain event happens. In order to implement a Collector, we just need the following code:

package main
import "github.com/gocolly/colly"
func main() {
c := colly.NewCollector()
}
view raw main.go hosted with ❤ by GitHub

You can find the list of the methods which accept a callbacks function here.

To give it a try, let’s use the OnRequest method. This method is called before every request. It takes a function as an argument. We can implement it, this way:

package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector()
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
}
view raw main.go hosted with ❤ by GitHub

The OnRequestmethod will be triggered before every request. In our case, it is expected to write the name of the URL we’re visiting in the console.

If you try to run our program right now, it will, unfortunately, start and stop instantly. The reason is simple, we need to provide it an URL to visit. For this, you just have to use the Visitmethod of our Collector .

package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector()
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.Visit("https://www.amazon.com/s?k=nintendo+switch&ref=nb_sb_noss_1")
}
view raw main.go hosted with ❤ by GitHub

Now if you try to run this code with

go run main.go

You should get the following result in your console:

Visiting https://www.amazon.com/s?k=nintendo+switch&ref=nb_sb_noss_1

Time to parse that HTML!

Now that we know how to request the Amazon’s result page, let’s do something with the HTML we get.

If we look at the methods that our Collector provides, the OnHTML one is probably the one we need. It takes a selector as the first argument and a callback function as the second one. It is probably a good thing to assume we can use the result’s list selector we determined previously as the first parameter.

package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector()
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnHTML("div.s-result-list.s-search-results.sg-row", func(e *colly.HTMLElement) {
// ...
})
c.Visit("https://www.amazon.com/s?k=nintendo+switch&ref=nb_sb_noss_1")
}
view raw main.go hosted with ❤ by GitHub

We observe that the callback function gives us access to an HTMLElement . This element is the result of what we get thanks to the selector we provided in the first argument.

We will use the ForEach method provided by the type HTMLElement in order to loop through the products in the search result list.

package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector()
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnHTML("div.s-result-list.s-search-results.sg-row", func(e *colly.HTMLElement) {
e.ForEach("div.a-section.a-spacing-medium", func(_ int, e *colly.HTMLElement) {
// ...
})
})
c.Visit("https://www.amazon.com/s?k=nintendo+switch&ref=nb_sb_noss_1")
}
view raw main.go hosted with ❤ by GitHub

The callback function passed to the ForEach method gives us access to each product one by one. From there, we can simply access the value we want with the CSS selectors we discovered in the first part. For example, the product’s name would be accessed like this:

package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector()
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnHTML("div.s-result-list.s-search-results.sg-row", func(e *colly.HTMLElement) {
e.ForEach("div.a-section.a-spacing-medium", func(_ int, e *colly.HTMLElement) {
var productName string
productName = e.ChildText("span.a-size-medium.a-color-base.a-text-normal")
if productName == "" {
// If we can't get any name, we return and go directly to the next element
return
}
fmt.Printf("Product Name: %s \n", productName)
})
})
c.Visit("https://www.amazon.com/s?k=nintendo+switch&ref=nb_sb_noss_1")
view raw main.go hosted with ❤ by GitHub

For every product’s name we get, we print it. If you run your code now, you’d have a result look like that:

Product Name: Super Smash Bros. Ultimate 
Product Name: New Super Mario Bros. U Deluxe - Nintendo Switch 
Product Name: Accessories kit for Nintendo Switch, VOKOO Steering 
Product Name: AmazonBasics Car Charger for Nintendo Switch
...

We could use the same method for the stars and the prices. But as I mentioned in the first part of the article, you’ll probably encounter some formatting issues. For example, instead of having 299.00 for the price, you might have something like $299.00$480.00. This is because the CSS selector we provided is returning multiple prices for one article if this one is on sale for example. Like that product for instance:

Amazon's result double price

About the stars, the selector we provided returns something like “4.5 out of 5 stars”. Out of this result, our goal is to extract the first three characters.

To fix our prices and stars problems I created two small helper functions that will allow us to format the results the way we want. I won’t go through them in details since it would be out of the topic of this article. But here is the code:

package utils
import "regexp"
func FormatPrice(price *string) {
r := regexp.MustCompile(`\$(\d+(\.\d+)?).*$`)
newPrices := r.FindStringSubmatch(*price)
if len(newPrices) > 1 {
*price = newPrices[1]
} else {
*price = "Unknown"
}
}
func FormatStars(stars *string) {
if len(*stars) >= 3 {
*stars = (*stars)[0:3]
} else {
*stars = "Unknown"
}
}
view raw utils.go hosted with ❤ by GitHub

Here is how our main.go looks like when we apply those two function:

package main
import (
"fmt"
"github.com/gocolly/colly"
"github.com/mottet-dev/medium-go-colly-basics/utils"
)
func main() {
c := colly.NewCollector()
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnHTML("div.s-result-list.s-search-results.sg-row", func(e *colly.HTMLElement) {
e.ForEach("div.a-section.a-spacing-medium", func(_ int, e *colly.HTMLElement) {
var productName, stars, price string
productName = e.ChildText("span.a-size-medium.a-color-base.a-text-normal")
if productName == "" {
// If we can't get any name, we return and go directly to the next element
return
}
stars = e.ChildText("span.a-icon-alt")
utils.FormatStars(&stars)
price = e.ChildText("span.a-price > span.a-offscreen")
utils.FormatPrice(&price)
fmt.Printf("Product Name: %s \nStars: %s \nPrice: %s \n", productName, stars, price)
})
})
c.Visit("https://www.amazon.com/s?k=nintendo+switch&ref=nb_sb_noss_1")
}
view raw main.go hosted with ❤ by GitHub

If you run the program now, the results would look like this:

Product Name: Nintendo Switch - Gray Joy-Con 
Stars: 4.5 
Price: Unknown 
Product Name: Nintendo Switch Console w/ Mario Kart 8 Deluxe 
Stars: 4.7 
Price: 394.58 
Product Name: Lego Star Wars  Skywalker Saga - PlayStation 4 Standard Edition 
Stars: Unknown 
Price: 59.99

Conclusion

In this article, we saw how to use the basics of Go and Colly by fetching data from Amazon. You can clone the full project from here. There are still a lot of things that can be improved, such as handling pagination, using different User Agent, concurrent requests, and more. Those topics will be covered in the next article. I’ll post the link here once it will be released.

I hope you enjoyed this article even though I’m not using Python. I chose Go because I saw there is good potential with Web Scraping in this language, but there isn’t a lot of documentation about it yet.

One more thing, I’m using Go for about one year, therefore I’m not an expert yet. If you see things I could improve, don’t hesitate to let me know. Thank you for reading my article!