{"pageProps":{"posts":[{"id":"5ec5349193fe6529edf4fea0","uuid":"cab6fa14-ffc1-473b-82d9-c7855d9d7918","title":"Go-Funk: Utility functions in your Go code","slug":"go-funk","html":"

Introduction

If you're coming from languages like Javascript or Python, you could have some surprises programming in Go. You'll quickly notice that all those functions such as filter, map or reduce are not part of this ecosystem. The fact that you don't have a built-in function to check if an element is part of a slice is even more shocking. When you'd simply do: [1, 2, 3, 4, 5].includes(5) in Javascript or: 5 in [1, 2, 3, 4, 5] in Python, you'll discover that it's a whole other story in Go.

The first reflex, of course, is to Google the classical \"golang check if element in slice\" hoping to find a built-in way of doing such operations. The first links on the result's page will quickly put an end to your expectation. You'll learn that Go doesn't have a built-in way to handle those functions.

Fortunately, there is a small library called Go-Funk. Go-funk contains various helper functions such as Filter, Contains, IndexOf, etc. I will cover a few of them in this article and show you how to avoid a few traps.

The issue with that kind of helpers is that it's extremely difficult to make a single generic function that handles all the types while keeping the type safety. Let's take the Contains function of Go-Funk for example. This function works with all the possible types. Unfortunately, it will be impossible for the compiler to check the types of the values passed as parameters. See by yourself the function's signature:

To try to fix this issue, Go-Funk implements specific helper function for the standard Go types. For example, the function ContainsInt exists specifically to handle the case where we'd want to check if an int is in a slice of int. We'll see later that it can be trickier with custom types.

Example

Map

The Map function iterates through the elements of a slice (or a map) and modifies them. It returns a new array.

Here is the result of you run this code:

[1 2 3 4 5]\n[2 3 4 5 6]

The result is what we could expect from this code. But if you have a closer look, you'll see an issue with the type of the newSlice variable: it is interface{}. To fix this problem, we'll make use of type assertions. In order to do this, you'd just have to add .([]int) right after funk.Map(...). This way, we're telling the Go compiler: \"You can't determine what's the type of this value, but I assure you it's a slice of int, so could you please convert it ?\".

The only issue with type assertion is that the errors aren't caught during compiling but only during runtime. So you have to be extra careful when using this method.

Filter

Filter takes a slice and a callback function as an argument. The callback function returns a Boolean. Filter passes the slice's elements one by one through the callback function: if it returns true, the element will be included in the new slice.

In this situation, we iterate through the numbers and say: \"take all of them except if the number is equal to 2\". Here is the result:

[1 2 3 4 5]\n[1 3 4 5]
Note: In that situation again, we had to make use of a type assertion. This will be the case during most of the examples.

Reduce

The Reduce function takes the following arguments: a slice, a callback function and, an initial value for the accumulator. In Go-Funk, this function returns a float. Here is how you'd use it:

As you can see, the function iterates through the slice element by element. The callback function adds the current element to the accumulator. Result:

[1 2 3 4 5]\n15

With the Reduce function, the type assertion is not necessary. If you have a look at the function signature, you'll notice that it will always return a float.

Conclusion

This was only an overview of the functions you can find in the Go-Funk package. Even though those are quite practical, I'd recommend to use them only for prototyping: the fact that they can't handle typing properly (especially custom types) can make your code unsafe in a production environment.

If you want to have the same functionalities in your production code, the only solution you’d have is to write your own functions for your custom types. You could still use the Go-Funk type-specific functions for the standard Go types. Even if it’s not the most convenient way at first, it definitely pays in the end, since the compiler will be able to catch type related bugs at compile time.

As always, thank you for reading this article. I hope you enjoyed it!

Stay in touch if you want to get more articles about Go! If you're interested in Web Scraping, don't worry: new articles will come up very soon.

","comment_id":"5d0bdce3e648730fadb5f4bb","feature_image":"http://161.35.123.88/content/images/2020/05/funk-title.jpg","featured":false,"visibility":"public","send_email_when_published":false,"created_at":"2019-06-20T21:22:11.000+02:00","updated_at":"2020-05-20T15:51:16.000+02:00","published_at":"2019-06-23T18:48:26.000+02:00","custom_excerpt":"Make your life easier with this library of helper functions.","codeinjection_head":null,"codeinjection_foot":null,"custom_template":null,"canonical_url":null,"url":"http://161.35.123.88/go-funk/","excerpt":"Make your life easier with this library of helper functions.","reading_time":3,"og_image":null,"og_title":null,"og_description":null,"twitter_image":null,"twitter_title":null,"twitter_description":null,"meta_title":null,"meta_description":"Make your life easier with this library of helper functions.","email_subject":null},{"id":"5ec5349193fe6529edf4fe9c","uuid":"eb4cc696-09ba-4db0-922d-0e7d82803532","title":"Increase your scraping speed with Go and Colly! — Advanced Part","slug":"increase-your-scraping-speed-with-go-and-colly----advanced-part","html":"

Let’s unleash the power of Go and Colly to see how we can scrape Amazon’s product list.

\"Post

Introduction

This post is the follow up to my previous article. If you haven’t already done it, I’d recommend that you have a look at it so you can have a better understanding of what I’m talking about here and it will be easier for you to code along.

Increase your scraping speed with Go and Colly! — The Basics
Let’s scrape Amazon to see how fast this can be. But first, let’s learn about the basics.

In this writing, I’ll show you how to improve the project we started by adding functionalities such as random User-Agent, proxies switcher, pagination handling, random delays between requests, and parallel scraping.

The goal of those methods is first, to improve the harvesting’s speed of the information we need. Second, we also need to avoid getting blocked by the platform we’re extracting data from. Some websites will block you if they notice you’re sending too many requests to them. I want to specify that our goal here is not to flood them with requests, but just to avoid getting blocked while extracting the data we need at an appropriate speed.

The Implementation

Randomize the User-Agent

The “User-Agent” is like the identifier of your browser and operating system. This information is usually sent with every request you make to a website. To know what’s your current User-Agent, you can write “What’s my user agent” in Google and you’ll find out the answer.

Why do we need to randomize it?
The User-Agent needs to be randomized to avoid your script getting detected by the source we’re getting data from (Amazon in our case). For instance, if the people working at Amazon notice that a lot of requests contain the same User-Agent string, they could block you thanks to this information.

The solution
Lucky for us, Colly provides a package called extensions. As you can see in the documentation, it contains a method called RandomUserAgent . It simply takes our Collector as a parameter.

And that’s it! With this line of code, Colly will now generate a new User-Agent string before every request.

You can also specify the following code in the OnRequest method:

That way, our program will print the User-Agent string it uses before sending the requests and we can make sure the method provided by the extensions package works

Pagination

For now, we’re only scraping the first result’s page of Amazon. It would be great to get a more complete set, right?

If we take a look at the result’s page. We notice that the pagination can be changed via the URL:

https://www.amazon.com/s?k=nintendo+switch&page=1

We also observe that Amazon doesn’t allow us to go over page 20:

\"Amazon's

With that information, we can determine that all the result page can be accessed by modifying the c.Visit(url) in our current code.

Thanks to a for loop, we’re now sending requests to all the pages from 1 to 20. This allows us to get more products’ information.

Parallelism

If you try to start the script at this point, you’ll notice that it isn’t very fast. It takes around 30 seconds to fetch the 20 pages. Lucky us, Colly provides parallelism out of the box. You just have to specify an option in the NewCollector method when you create the Collector .

This option basically says to Colly: “You don’t have to wait that a request ends to start the next ones”. The c.Wait() at the end is here to make the program wait until all the concurrent requests are done.

If you run this piece of code now, you’ll see that it is much faster than our first try. The console output will be a bit messy due to the fact that multiple requests print data at the same time, but the whole process should take approximately 1 second to be done. We see here that it is quite an improvement compared to the 30 seconds of our first try!

Random delays between every request

In order to avoid getting blocked and also for your bot to appear more like a human, you can set random delays between every request. Colly provides a Limit method that allows you to specify some set of rules.

Here you can notice that I added a random delay of 2 seconds. It means that between the requests, there will be a random delay of a maximum of 2 seconds that will be added.

As you can see, I also added a Parallelism rule. This determined the maximum number of request that will be executed at the same time.

If we run our program now, we can see that it runs a bit slower than previously. This is due to the rules we just set. We need to define a balance between the scraping speed we need and the chances of getting blocked by the target website.

Proxy Switcher

Why do we need a proxy switcher?
One of the main thing that can get us blocked while scraping a website is our IP address. In our case, if Amazon notices that a large number of requests is sent from the same IP, they can simply block the address and we won’t be able to scrape them for a while. Therefore, we need a way to “hide” the origin of the requests.

The solution
We will use proxies in order to do this. We will send our requests to the proxy instead of Amazon directly and the proxy will take care of passing our requests to the target website. That way, in the Amazon logs, it will appear that the request was coming from the proxy’s IP address and not our.

\"Proxy
Wikipedia - Proxy Illustration

Of course, the idea behind it is to use multiple proxies in order to dilute all of our requests between each proxy. You can easily find lists of free proxies searching through Google. The issue with those is that they can be extremely slow. If speed matters for you, having a list of private or semi-private could be a better choice. Here is the implementation with Colly (and free proxies I found online):

It’s possible that the proxies I used are not working anymore at the time you’re reading this article. Feel free to modify them.

We’re making usage of the proxy package from Colly. This package contains the method RoundRobinProxySwitcher . This method takes strings containing the protocol, the address and the port of the proxies as an argument. We then pass the proxySwitcher to the Collector with the help of the SetProxyFunc method. After this is done, Colly will send the requests through the proxies and select another proxy before each new request.

Write the result in a CSV file

Now that we have a proper way to fetch the data from Amazon, we just have to implement a way to store it. In this part, I’ll show you how to write the data in a CSV file. Of course, if you want to store the data in another way, feel free to do it. Colly has even some built-in storage implementation. Let’ start by modifying the beginning of our main function.

First, we create a file called amazon_products.csv . We then create a writer that will be used to save the data we fetch from Amazon in our file. On line 12, we write the first entry of the CSV file, defining the title of the column.

Then, in the callback function that we pass to the ForEach method, instead of writing the results we get, we’ll write them in the CSV file. Like this:

Here are the results we get once we run the software now. You should have a new file in the working folder. If you open it with Excel (or a similar program), here is how it looks like:

\"Results

Conclusion

We implemented many options that improved the way our original scraper was working. There are still many things that can be done to improve it even further such as saving the data in a database, handle requests error (such as 404 when we request a page that doesn’t exist), etc. Don’t hesitate to improve this code or implement it on something else than Amazon, it’s a great exercise!

Disclaimer: Use the knowledge you’ve gained with this article wisely. Don’t send a huge number of requests to a website in a short amount of time. In the best case, they could just block you. In the worst, you could have problems with the law.

Thank you for reading my article. I hope it was useful for you. If you couldn’t follow along with the code, you can find the full project in this Github repository.

Happy scraping!\n\n

","comment_id":"5d09350ee648730fadb5f478","feature_image":"http://161.35.123.88/content/images/2020/05/amazon-scrape-fast-2.jpg","featured":false,"visibility":"public","send_email_when_published":false,"created_at":"2019-06-16T02:00:00.000+02:00","updated_at":"2020-06-07T21:58:37.000+02:00","published_at":"2019-06-16T02:00:00.000+02:00","custom_excerpt":"\nLet’s unleash the power of Go and Colly and see how fast we can scrape Amazon’s product list.\n","codeinjection_head":null,"codeinjection_foot":null,"custom_template":null,"canonical_url":"https://medium.com/@mottet.dev/increase-your-scraping-speed-with-go-and-colly-advanced-part-a38648111ab2","url":"http://161.35.123.88/increase-your-scraping-speed-with-go-and-colly----advanced-part/","excerpt":"\nLet’s unleash the power of Go and Colly and see how fast we can scrape Amazon’s product list.\n","reading_time":6,"og_image":null,"og_title":null,"og_description":null,"twitter_image":null,"twitter_title":null,"twitter_description":null,"meta_title":null,"meta_description":null,"email_subject":null},{"id":"5ec5349193fe6529edf4fe9d","uuid":"f07c1627-2eec-4b3c-8db4-fd63b18b31f9","title":"Increase your scraping speed with Go and Colly! — The Basics","slug":"increase-your-scraping-speed-with-go-and-colly----the-basics","html":"

Let’s scrape Amazon to see how fast this can be. But first, let’s learn about the basics.

\"Post

Introduction

In this article, we’ll explore the power of Go(lang). We’ll see how to create a scraper able to get basic data about products on Amazon.
The goal of this scraper will be to fetch an Amazon result page, loop through the different articles, parse the data we need, go to the next page, write the results in a CSV file and… repeat.

In order to do this, we’ll use a library called Colly. Colly is a scraping framework written in Go. It’s lightweight but offers a lot of functionalities out of the box such as parallel scraping, proxy switcher, etc.

This article will cover the basics of the Colly framework. In the next one, we’ll go more in details and implement improvements/optimizations for the code we’ll be writing today.

Let’s inspect Amazon to determine the CSS selectors

\"Amazon
Here is how Amazon’s result page looks like

From this page, we would like to extract the name, the rating (stars) and the price for each product appearing in the result’s page.

We can notice that all the pieces of information we need for each product are in this area:

\"Amazon's

With the help of the Google Chrome Inspector, we can determine that the CSS selector for those elements is “div.a-section.a-spacing-medium”. Now, we just have to determine the selectors for the name, the stars, and the price. All of those can be found thanks to the inspector. Here are the results:

Name: span.a-size-medium.a-color-base.a-text-normal\nStars: span.a-icon-alt\nPrice: span.a-price > span.a-offscreen

Those selectors are not perfect: we will see later that we’ll encounter some edge cases where we’ll need to format the values we extracted. But for now, we can work with that.

The selector of the results list itself is “div.s-result-list.s-search-results.sg-row”. So the logic for our scraper will be: “For each product in the results list, fetch its name, stars, and price”

We’ll also handle the pagination in another section. For now, we can just see that the URL of the results page looks like this

https://www.amazon.com/s?k={search-term}&ref=nb_sb_noss_1

In our case:

https://www.amazon.com/s?k=nintendo+switch&ref=nb_sb_noss_1

It is now time to implement what we found out in Go with the help of Colly.

Go & Colly implementation

Let’s create our Collector !

In Colly, you need first to implement a Collector. A Collector will give you access to some methods allowing you to trigger callback functions when a certain event happens. In order to implement a Collector, we just need the following code:

You can find the list of the methods which accept a callbacks function here.

To give it a try, let’s use the OnRequest method. This method is called before every request. It takes a function as an argument. We can implement it, this way:

The OnRequestmethod will be triggered before every request. In our case, it is expected to write the name of the URL we’re visiting in the console.

If you try to run our program right now, it will, unfortunately, start and stop instantly. The reason is simple, we need to provide it an URL to visit. For this, you just have to use the Visitmethod of our Collector .

Now if you try to run this code with

go run main.go

You should get the following result in your console:

Visiting https://www.amazon.com/s?k=nintendo+switch&ref=nb_sb_noss_1

Time to parse that HTML!

Now that we know how to request the Amazon’s result page, let’s do something with the HTML we get.

If we look at the methods that our Collector provides, the OnHTML one is probably the one we need. It takes a selector as the first argument and a callback function as the second one. It is probably a good thing to assume we can use the result’s list selector we determined previously as the first parameter.

We observe that the callback function gives us access to an HTMLElement . This element is the result of what we get thanks to the selector we provided in the first argument.

We will use the ForEach method provided by the type HTMLElement in order to loop through the products in the search result list.

The callback function passed to the ForEach method gives us access to each product one by one. From there, we can simply access the value we want with the CSS selectors we discovered in the first part. For example, the product’s name would be accessed like this:

For every product’s name we get, we print it. If you run your code now, you’d have a result look like that:

Product Name: Super Smash Bros. Ultimate \nProduct Name: New Super Mario Bros. U Deluxe - Nintendo Switch \nProduct Name: Accessories kit for Nintendo Switch, VOKOO Steering \nProduct Name: AmazonBasics Car Charger for Nintendo Switch\n...

We could use the same method for the stars and the prices. But as I mentioned in the first part of the article, you’ll probably encounter some formatting issues. For example, instead of having 299.00 for the price, you might have something like $299.00$480.00. This is because the CSS selector we provided is returning multiple prices for one article if this one is on sale for example. Like that product for instance:

\"Amazon's

About the stars, the selector we provided returns something like “4.5 out of 5 stars”. Out of this result, our goal is to extract the first three characters.

To fix our prices and stars problems I created two small helper functions that will allow us to format the results the way we want. I won’t go through them in details since it would be out of the topic of this article. But here is the code:

Here is how our main.go looks like when we apply those two function:

If you run the program now, the results would look like this:

Product Name: Nintendo Switch - Gray Joy-Con \nStars: 4.5 \nPrice: Unknown \nProduct Name: Nintendo Switch Console w/ Mario Kart 8 Deluxe \nStars: 4.7 \nPrice: 394.58 \nProduct Name: Lego Star Wars  Skywalker Saga - PlayStation 4 Standard Edition \nStars: Unknown \nPrice: 59.99

Conclusion

In this article, we saw how to use the basics of Go and Colly by fetching data from Amazon. You can clone the full project from here. There are still a lot of things that can be improved, such as handling pagination, using different User Agent, concurrent requests, and more. Those topics will be covered in the next article. I’ll post the link here once it will be released.

I hope you enjoyed this article even though I’m not using Python. I chose Go because I saw there is good potential with Web Scraping in this language, but there isn’t a lot of documentation about it yet.

One more thing, I’m using Go for about one year, therefore I’m not an expert yet. If you see things I could improve, don’t hesitate to let me know. Thank you for reading my article!\n\n

","comment_id":"5d09350ee648730fadb5f479","feature_image":"http://161.35.123.88/content/images/2020/05/amazon-scrape-fast.jpg","featured":false,"visibility":"public","send_email_when_published":false,"created_at":"2019-06-10T02:00:00.000+02:00","updated_at":"2020-05-24T00:48:55.000+02:00","published_at":"2019-06-10T02:00:00.000+02:00","custom_excerpt":"Let’s scrape Amazon to see how fast this can be. But first, let’s learn about the basics\n","codeinjection_head":null,"codeinjection_foot":null,"custom_template":null,"canonical_url":"https://medium.com/@mottet.dev/increase-your-scraping-speed-with-go-and-colly-the-basics-41038bc3647e","url":"http://161.35.123.88/increase-your-scraping-speed-with-go-and-colly----the-basics/","excerpt":"Let’s scrape Amazon to see how fast this can be. But first, let’s learn about the basics\n","reading_time":5,"og_image":null,"og_title":null,"og_description":null,"twitter_image":null,"twitter_title":null,"twitter_description":null,"meta_title":null,"meta_description":null,"email_subject":null},{"id":"5ec5349193fe6529edf4fe9e","uuid":"beffe7ef-52c4-4e8a-b143-0cd50a6dba04","title":"Real-time Scraping With Python!","slug":"real-time-scraping-with-python-","html":"

Let’s build a real-time scraper with Python, Flask, Requests, and Beautifulsoup!

\"Post

Introduction

In this article, I will show you how to build a real-time scraper step-by-step. Once the project is done, you’ll be able to pass arguments to the scraper and use it just like you would use a normal API.

This article is similar to one in my previous article where I was talking about Scrapy and Scrapyrt. The difference here is that you can set up the endpoint to behave in a much more precise way.

As an example, we will see how to scrape data from Steam’s search results. I chose this example because the scraping part is quite straight forward and we’ll be able to focus more on the other aspects of the infrastructure.

Disclaimer: I won’t be spending too much time explaining in details the analysis and the scraping part. I’m assuming you already have some basics with Python, Requests, and Beautifulsoup, and that you know how to inspect a website to extract the CSS Selectors.

Analysis

Let’s first start investigating how the website is working. At the moment I’m writing those lines, the search bar is situated on the top right of the page.

\"Steam
Steam Search Bar

Let’s type something in it and press Enter to observe the behavior of the website.

We are now redirected to the search results page. Here you can see a list of all the games related to your search. In my case, I have the following:

\"Steam
Steam results page for “The Witcher”

If we inspect the page, we notice that each result row is inside a <a> tag with a search_result_row class. The elements that we’re looking for are situated in the following selectors:

gameURL: situated in the href of 'a.search_result_row'\ntitle: text of 'span.title'\nreleaseDate: text of 'div.search_released'\nimgURL: src of 'div.search_capsule img'\nprice: text of 'div.search_price span strike'\ndiscountedPrice: text of 'div.search_price'

Another interesting element is the URL of the page.

\"Steam
Steam Search URL

We can see that the terms we are looking for are provided after the parameter term.

So far, after having those elements, we are capable of writing a simple script that fetches the data we need. Here is the example file main.py :

You’ll need to install requests and beautifulsoup4 to be able to run this script. For this, I encourage you to use pipenv that allows you to install those in a virtual environment specially created for your project.

pipenv install requests beautifulsoup4\npipenv run python main.py

Conversion to Real-Time Scraper

Before we begin, here is a little schema of the architecture we want to implement.

\"Real-time
The architecture of the real-time scraper
  1. At this stage, the client or frontend (depending on your needs) is making a PUT request containing the search term in the arguments to the HTTP server.
  2. The HTTP server receives the request and processes it to extract the search term.
  3. The server then makes a GET request to the steam store to pass it the search term.
  4. Steam sends back its search results page in an HTML format to the server
  5. At this point, the server receives the HTML, formats it to extract the game's data that we need.
  6. Once processed, the data are sent to the client/frontend in a nicely formatted JSON response.

An HTTP server with Flask and Flask_restful

Flask is a very useful Python framework made to quickly create a web server. Flask_restful is an extension for Flask that allows us to develop easily a REST API.

First, let’s install those two libraries by running the following command:

pipenv install Flask flask-restful

Let’s import those two libraries inmain.py

You can now create the Flask application and declare it at the beginning of the file after the imports.

Let’s refactor the scraper that we had previously to tell Flask that it should now be part of a resource and accessible via a PUT request. For that, we need to create a new class called SteamSearch (the name is up to you) that inherits from the Resource that we import from flask_restful . We then put our code in a method named put to indicated that it can be accessed by this type of request. The final result looks like the following:

At the bottom of the file, we need to say to Flask that the StreamSearch class is a part of the API. We also need to specify a route where the resource can be requested. For this, you can use the following code:

The lines 3 and 4 are simply there to run the app. The parameter debug=True is there to make our life easier during the development by auto-refreshing the server when we make modification in the code. The value needs to be set to wrong if you want to deploy the server in production!

The last thing we need to do is to handle the argument passed in the PUT request our server receives. This can be achieved with the help of reqparse that we imported from flask_restful .

With this helper, we can define what arguments can be sent in the request body, what are their types, are they required, etc. You can add the following code at the very top of the put method.

After this step, the search time can be accessed in the put method via args.term . If you need other arguments, you can add as many as you want following the second line of the example code.

There is one last step that needs to be done before we finish our little project: it is possible that the search term that we send to the server contains special characters or whitespaces. This might make the GET request to Steam failed. To solve this problem, we need to encode the term we receive with the help of the parser included in the urllib library.

Right before we make the request to the Steam store, we can add those lines to our code.

The first line, as I said previously, will format the term if it contains non-supported characters. The output of this function will be something liketerm=$valueOfArgsTerm .

We then pass this value to the GET request and we are done!

In the end, the code should look like this:

You can start your HTTP server with pipenv run python main.py

Let’s try to make a request to our server with Postman. The result will look like this:

\"Postman's
Postman

Conclusion

Thank you for reading this article! If you want to train a bit more on this topic, you can try to make your server able to handle the page number of the steam search results.

\"Hint
Hint: have a look at the URL!

I’ll be soon posting a follow-up article where we will see how we can deploy this live scraper project to the cloud so we can use it in a “real-world” situation. See you soon!\n\n

","comment_id":"5d09350ee648730fadb5f47a","feature_image":"http://161.35.123.88/content/images/2020/05/livescrapingwithpython.jpeg","featured":false,"visibility":"public","send_email_when_published":false,"created_at":"2019-02-11T01:00:00.000+01:00","updated_at":"2020-05-24T00:48:02.000+02:00","published_at":"2019-02-11T01:00:00.000+01:00","custom_excerpt":"\nLet’s build a real-time scraper with Python, Flask, Requests, and Beautifulsoup!\n","codeinjection_head":null,"codeinjection_foot":null,"custom_template":null,"canonical_url":"https://medium.com/@mottet.dev/real-time-scraping-with-python-5ca773ee473d","url":"http://161.35.123.88/real-time-scraping-with-python-/","excerpt":"\nLet’s build a real-time scraper with Python, Flask, Requests, and Beautifulsoup!\n","reading_time":5,"og_image":null,"og_title":null,"og_description":null,"twitter_image":null,"twitter_title":null,"twitter_description":null,"meta_title":null,"meta_description":null,"email_subject":null},{"id":"5ec5349193fe6529edf4fe9b","uuid":"ea3ab486-a411-45ac-b62c-4dc48dc771aa","title":"Let’s create an Instagram bot to show you the power of Selenium!","slug":"let-s-create-an-instagram-bot-to-show-you-the-power-of-selenium-","html":"

You’ll be able to apply what you learn to any web application.

\"Post

Introduction

Here is the definition of Selenium given by their official website:

Selenium automates browsers.

That’s it. This is the most representative definition of what Selenium is. With this library, you’ll be able to control a web browser and interact with any websites. It was originally created to run tests on web applications you were developing but it can also be used as a web scraping tool or a way to create a bot.

In this article, we will see how to create a simple Instagram bot to show you what Selenium is capable of.

Why aren’t we using libraries like Scrapy or Requests to perform the actions required by our bot?

The reason is that Scrapy or Requests don’t perform very well with Javascript made websites. We use Selenium because of its ability to render a page using Javascript just like a normal browser such as Chrome or Firefox.

What are the functionalities we want to implement for our bot?

The goal of this article is to give you an overview of the possibilities given by Selenium, therefore, I won’t be able to show you how to code every action possible by our bot on Instagram, but with the knowledge you’ll acquire reading this article, you will be able to add the missing functionalities on your own. For now, our bot should be capable of the following actions:
- Sign in.
- Follow a user
- Unfollow a user
- Get a user’s followers

The architecture of the script

To keep our code organized and reusable in other projects, we will put our code in a class named InstagramBot. Every action the bot will be capable of doing will be a method.

class InstagramBot():\n   def __init__\n   def signIn\n   def followWithUsername\n   def unfollowWithUsername\n   def getUserFollowers\n   def closeBrowser\n   def __exit__

Let’s get started

First, let’s install Selenium by simply running the command:

pip install selenium

Once it’s done, create a file name main.py in the folder of your choice.

We’ll then need to import the webdriver object from Selenium in our script. This will allow us to control Chrome with our code.

The constructor will take the user’s mail and password as an argument. We also create our webdriver in this method and make it accessible to the rest of the class.

Note: If you don’t have Chrome installed on your machine or if webdriver.Chrome() throws an error, you need to download ChromeDriver from here. (Chose the one compatible with your Operating System). Then just pass the ChromeDrive’s path as the first parameter of the method. For example, if your OS is Windows and the ChromeDriver is in the same folder of your script:webdriver.Chrome('chromedriver.exe')

Now let’s define the signIn method. Our bot will have to access this URL https://www.instagram.com/accounts/login/ and complete the login form with the email and password initialized in the constructor.

If you inspect the page, you’ll notice that there are only two <input> available. The first one will always take the email and the second one the password.

\"Instagram

This means we can select those two inputs with:

Then we simply have to complete them with the help of Selenium, so the form will be able to be sent. For that, we will make use of the .send_keys method.

Selenium will write the email and the password in the corresponding <input>.

Now the last thing we need to do is to send the form. We could select the button and simulate a click on it to accomplish that. But there is actually a shorter method: most of the forms can be sent pressing the ENTER key once an input is focused. This means that in our case we will simply say to Selenium to hit the ENTER key after writing in the password field.

We completed the signIn method!

I took the liberty to add time.sleep(2) at the end of the method. Like this, you’ll have a bit of time to see what’s going on when the script is running.

So far, our code should look something like this:

You can already test it by adding the following lines at the end the file (do not add it inside the class!)

bot = InstagramBot('youremail', 'yourpassword')\nbot.signIn()

Let’s open our terminal and run the following command:

python main.py

A new instance of Chrome should open the Instagram login page. The inputs should be completed and after a couple of seconds, you should be redirected to your home page.

Note: Instagram is a complex web application. It is completely possible that after your login, instead of directing you to your home page, Instagram would display a page asking you if you want to download their mobile application. It is also possible that you end up on another page containing another form asking you to confirm your identity. I won’t cover those possibilities in this article to keep it short. But I invite you to implement your own solutions as an exercise.

Let’s follow people

If you want to follow a user on Instagram, the most common way is to go on their page and to click on the “Follow” button.

\"Instagram
Example of an Instagram profile layout.

If we inspect the page, we notice that there are three different buttons on it and the “Follow” one is the first in the list.

\"Instagram

We can conclude that the “Follow” button can be selected with the following code:

Note: Notice that there are two methods to select elements with a CSS selector:
.find_element_by_css_selector()
- .find_elements_by_css_selector()
The first one will return the first element corresponding to our search on the page. The second one will return all the elements found on the page in an array.

With that information, let’s start to implement our followWithUsername method.

The method takes the username of the person we want to follow as an argument. Then we tell Selenium to go on the person’s page, select the “Follow” button and click on it.

There is still an issue with this method: if we are already following someone, Selenium will still go on that person’s page and click on the first button it will find.

\"Instagram

In that case, the first button found is the “Following” button. If we click on it, Instagram will display a modal asking use if we want to unfollow the person. That is not optimal. We can refactor our followWithUsername method by checking if the button text is not equal to “Following” before clicking on it.

This concludes our followWithUsername method… or maybe not? If you’re having your browser or your operating system’s language set as something else than English, you might encounter some issues: when you open Instagram, they might display the pages in your default language. The condition followButton.text != 'Following' will always return true.

To fix this issue, we can configure our webdriver so it will always use English as the default language. Instagram’s interface will always contain the same text. To apply that, we will make use of the chrome_options argument in the webdriver.Chrome() method. Here is how our refactored __init__ method looks like:

On the second line, we set up a new variable containing empty ChromeOptions. On the next line, we specify that our language is English. We then just have to pass the argument chrome_options, when we initializewebdriver.Chrome() . With that fix, we have the guarantee that the page we load will always be in English.

Unfollow Method

To implement the unfollowWithUsername method, we can take the followWithUsername method as an example. The beginning is the same: we go to the user’s page and click on the first button.

Except that this time, a modal will open to ask a confirmation.

\"Unfollow

We need to click on the “Unfollow” button to complete the action. In this situation, we will make use of the XPath selector instead of the usual selector. This method makes it easier to look for elements when our selection depends on their text.

As you can see, this works the same way we’re used to: we select an element with the XPath selector and we simulate a click on it.

You can test the two methods we just created by using this code after the class :

bot = InstagramBot('youremail', 'yourpassword')\n\nbot.signIn()\n\nbot.followWithUsername('therock')\n\nbot.unfollowWithUsername('therock')

This should open Chrome, login to Instagram, follow “The Rock” and unfollow him.

Our last feature: get a list of a user’s followers

Let’s start the implementation of the getUserFollowers method. It will take two arguments: the target’s username and the number of follower’s links we want to fetch.

To achieve such thing on a real browser, we would have to navigate to the user’s profile and click on the “x followers” element.

\"Instagram

Instagram then opens a modal with the followers’ list.

\"Followers

The list contains only a dozen users. You can only get more by scrolling down.

You can apply the same steps with Selenium:

Let’s break down the code to understand what’s happening:

We navigate to the user’s profile page, locate the “x followers” button and simulate a click on it. The modal opens at this moment.

We select the modal on the page and count how many followers are in the list. To select those elements, I use the same strategies that I applied to the previous methods. There is just a small difference with the part div[role=\\'dialog\\'] ul . We say to Selenium to select the <ul> inside a <div> with a “role” attribute equal to “dialog”. We’re using the backslash so that Python doesn’t think the string ends here. time.sleep() is required there otherwise, our script will try to select the element before it renders and throw an error.

The click on the modal ( followersList.click() ) will “focus” it and allow us to use the SPACE key to scroll down.

In this part, we define an actionChain. ActionChains basically allow us to execute a list of actions (press a key, move the mouse, etc.) in a precise order. Here we say to our script: as long as the number of followers in the list is lower than the number required, press SPACE. After each press, we refresh the number of users we have in the list and print it. (The print is not necessary here. It’s just a good way to have a visual check of the bot’s progress.).

Note: .perform() is added at the end of the actions so the events are fired in the order they are queued up.

The last part of the method is quite straightforward. Once the users’ list is larger than the required number, we loop through the whole list, extract the profile’s link, append it to a new list and return the full list once we reach the required number. Again, the print is not necessary and it is just here to check the progress of the script.

This ends our getUserFollowers method. Let’s test it by adding those lines after the class declaration:

print(bot.getUserFollowers('therock', 50))

This should print the list of 50 followers of The Rock.

Clean up!

We just have to add two methods to clean up our script after the execution and destroy the browser’s instance that we used.

This part just makes sure that self.browser.close() is called when the script stops running.

Here is the full code in case you need it:

Conclusion

You saw the main features of Selenium. You now have the tools to create an automation system on any website. As an exercise, you can improve and add features to the code we just wrote.

Before developing your own bot, it is always good to have a look if the website you want to automate offers the possibility to be interacted with through its own official API. It can save you a lot of time and it’s usually simpler to use.

Disclaimer: don’t use your automation system to flood a website with requests. In the case of our Instagram bot, for example, don’t use it to follow hundreds of users per minute. Your account will definitely be banned if you do such thing.

Thank you for reading this article. As always, if you have questions, you can reach me on Medium or on Twitter.

I also wanted to thank you for all the support I received for my first article. This was unexpected and made me very happy! I hope you’ll enjoy this article as much as you did with the previous one!\n\n

","comment_id":"5d09350ee648730fadb5f477","feature_image":"http://161.35.123.88/content/images/2020/05/instagram-selenium.jpeg","featured":true,"visibility":"public","send_email_when_published":false,"created_at":"2018-09-11T02:00:00.000+02:00","updated_at":"2020-05-24T00:46:09.000+02:00","published_at":"2018-09-11T02:00:00.000+02:00","custom_excerpt":"You’ll be able to apply what you learn to any web application.\n","codeinjection_head":null,"codeinjection_foot":null,"custom_template":null,"canonical_url":"https://medium.com/@mottet.dev/lets-create-an-instagram-bot-to-show-you-the-power-of-selenium-349d7a6744f7","url":"http://161.35.123.88/let-s-create-an-instagram-bot-to-show-you-the-power-of-selenium-/","excerpt":"You’ll be able to apply what you learn to any web application.\n","reading_time":9,"og_image":null,"og_title":null,"og_description":null,"twitter_image":null,"twitter_title":null,"twitter_description":null,"meta_title":null,"meta_description":null,"email_subject":null},{"id":"5ec5349193fe6529edf4fe9f","uuid":"98ed441b-f86e-40af-9e07-e684981f4fb6","title":"Scrapy and Scrapyrt: how to create your own API from (almost) any website","slug":"scrapy-and-scrapyrt--how-to-create-your-own-api-from--almost--any-website","html":"
\"Post

Introduction

Scrapy is a free and open-source web crawling framework written in Python. It allows you to send requests to websites and to parse the HTML code that you receive as response.

With Scrapyrt (Scrapy realtime), you can create an HTTP server that can control Scrapy through HTTP requests. The response send by the server are data formatted in JSON containing the data scraped by Scrapy.

It basically means that with the combination of these two tools, you can create an entire API without even having a database! I will show you how to process to achieve this.

Set up Scrapy and create your spider

If you don’t have Scrapy installed on your machine yet, run the following command (I will assume you have Python installed on your computer):

pip install scrapy

It will install Scrapy globally on your machine. You can also do it on a virtual environment if you prefer.

Once the installation completed, you can start a Scrapy project by running:

scrapy startproject <project_name>

In my case (and if you want to follow along the article), I’ll do

scrapy startproject coinmarketcap

We will scrape the URL: https://coinmarketcap.com/all/views/all/. It contains information about cryptocurrencies such as their current prices, their price variations, etc.

https://coinmarketcap.com/all/views/all/

The goal is to collect those data with Scrapy and then to return them as JSON value with Scrapyrt.

Your project folder structure should currently look like this:

\"Scrapy
Scrapy Project’s Folder Structure

We’ll now create our first Spider. For that, create a new file in the spiders folder. The file’s name doesn’t really matter, it should just represent what your spider is scraping. In my example, I will simply call it coinSpider.py.

First let’s create a class that inherits from scrapy.Spider.

A Spider class must have a name attribute. This element will help you to inform Scrapy which crawler you want to start.

Now let’s say to Scrapy what is the first URL you want to send a request to. We’ll do it with a start_requests method. This method will return the Scrapy request to the URL you want to crawl. In our case, it looks like this:

The scrapy.Request function takes the URL you want to crawl as the first parameter and a callback function that will parse the response you’ll receive from the request.

Our parse method will go through each row of the table containing the cryptocurrency data that we want for our API. It then selects the wanted information using CSS selector.
The line for row in response.css(“tbody tr”): basically says “take the content of the response, select all the <tr> in the <tbody>, assign individually the content of each of them in the row variable”. The value of this variable would look like something like this for the first line of the table:

We then loop through each row and apply one more CSS selector to extract the exact value that we want. For example, the name of the currency is contained in a link <a>that has the class currency-name-container assigned to it. By adding ::text to the selector we specify that we want the text between <a> and </a>. The method .extract_first() is added after the selector to indicate that we want the first value found by the parser. In our case, the CSS selector will return only one value for each element.

We repeat the process with all the data we want to extract, and we then return them in a dictionary.

Quick note: if the data that you want to extract is not between two HTML tags but in an attribute, you can use ::attr(<name_of_the_attribute>) in the CSS selector. In our case we have ::attr(data-usd) as an example.

Here is the complete version of our Spider:

Now let’s try to run it. For that, open your terminal and set your working directory in your Scrapy project folder. In my case, the command would be:

cd C:\\Users\\jerom\\Documents\\Code\\scrappy_test\\coinmarketcap

To start the crawler and save the scraped data in a JSON file, run the following command:

scrapy crawl <name_of_the_spider> -o <output_file_name>.json

In our case:

scrapy crawl coin -o coin.json

The file coin.json should be created at the root of your coinmarketcap folder

\"Scrapy

It should contain the result scraped by the spider similar to the following format:

If the format of the results is not similar to the example or if you have some errors, you can refer to this repository.

Install Scrapyrt and combine it with our project

Let’s now use Scrapyrt to serve those data through an HTTP request instead of having them saved in a JSON file.

The installation of Scrapyrt is quite strait forward. You just have to run

pip install scrapyrt

To use it, open your terminal again and set your working directory to the Scrapy project folder. Then run the following command:

scrapyrt -p <PORT>

<PORT> can be replaced with a port number. For example

scrapyrt -p 3000

With this command Scrapyrt will setup locally a simple HTTP server that will allow you to control your crawler. You access it with a GET request through the endpoint http://localhost:<PORT>/crawl.json. To work properly it also needs at least these two arguments: start_requests (Boolean) and spider_name (string). Here you’d access to the result by opening the following URL in your browser:

http://localhost:3000/crawl.json?start_requests=true&spider_name=coin

The result should look like this:

\"JSON
Note: If you’re on Chrome, you can install this plugin to format the json result nicely in your browser.

Conclusion

You saw the basic steps to create an API with Scrapy. You can have access to data from other websites for your own project. 
In the title, I specified “how to create your own API from (almost) any website”: this method will work with most of the websites, but it will be much more difficult to get data from a website that relies heavily on JavaScript.

Disclaimer: Don’t abuse it. If you have a large website with a lot of visitors or if you need to request the API frequently, contact the owners of the website for their permission before your scrape it. Sending a large number of requests to a website can make it crash or they could even ban your IP.

Thank you for your time reading my article. If you have any questions, don’t hesitate to contact me through Medium or Twitter.

This was my first article on Medium. I hope you enjoyed it as much as I enjoyed writing it! I will probably write more of them in the future.\n\n

","comment_id":"5d09350ee648730fadb5f47e","feature_image":"http://161.35.123.88/content/images/2020/05/scrapyrt-1.jpeg","featured":false,"visibility":"public","send_email_when_published":false,"created_at":"2018-08-25T02:00:00.000+02:00","updated_at":"2020-05-24T00:46:20.000+02:00","published_at":"2018-08-25T02:00:00.000+02:00","custom_excerpt":"Learn how to use Scrapyrt to set up your own API\n","codeinjection_head":null,"codeinjection_foot":null,"custom_template":null,"canonical_url":"https://medium.com/@mottet.dev/scrapy-and-scrapyrt-how-to-create-your-own-api-from-almost-any-website-ecfb0058ad64","url":"http://161.35.123.88/scrapy-and-scrapyrt--how-to-create-your-own-api-from--almost--any-website/","excerpt":"Learn how to use Scrapyrt to set up your own API\n","reading_time":5,"og_image":null,"og_title":null,"og_description":null,"twitter_image":null,"twitter_title":null,"twitter_description":null,"meta_title":null,"meta_description":null,"email_subject":null}],"settings":{"title":"Crazy Dev Blog","description":"Technology Posts, Automation, Programming, Web Scraping, Coding tutorials and more!","logo":null,"icon":"http://161.35.123.88/content/images/2019/06/Crazy-Dev-Blog-1.png","cover_image":null,"facebook":null,"twitter":"@JeromeDeveloper","lang":"en","timezone":"Europe/Amsterdam","navigation":[{"label":"Home","url":"/"},{"label":"Web Scraping","url":"/tag/web-scraping/"}],"secondary_navigation":[],"meta_title":null,"meta_description":null,"og_image":null,"og_title":null,"og_description":null,"twitter_image":null,"twitter_title":null,"twitter_description":null,"url":"http://161.35.123.88/","codeinjection_head":"\n\n\n\n\n","codeinjection_foot":null}},"__N_SSG":true}