Web scraping is the act of interacting with a website or service and collecting specific information. The tool presents the data collected in the most appropriate way to the needs of those who programmed it. ESETa leading proactive threat detection company, explains how it works and shares security tips for using it.
For example, if someone needs to obtain the daily quote value in dollars at a certain time: to obtain this information, simply go to an official website that has foreign currency quotes or consult a search engine, such as Google. Now, if the information of 14 more foreign currencies and 9 specific cryptocurrencies is also necessary. Web scraping is useful to optimize the search process and, with one or two clicks, collect as much information as possible.
What should I pay attention to?
Any user can perform web scraping because it is an automatic system that accesses a website to “see” information, however, according to ESET there are two important points to take into account:
- Criminals use web scraping to set up databases for exchange/sale purposes: Like any tool, whether digital or not, the destination given to its use will depend solely and exclusively on who is using it.
- It can be useful in helping a bank to obtain information about currency quotes. On the other hand, in the case of malicious use, it can be used to automate the collection of information from people, to later store the data in a file that will eventually be sold or traded on forums on the Deep or Dark Web.
“An example of its use happened some time ago when a large store ran a promotion in which it requested the registration data of its customers, but the page had the information public. The criminals analyzed this site and found that it was also possible to see this same page with the data of other clients, with this information in hand it was possible to create a scraper that collects and stores it”, says Camilo Gutiérrez Amaya, Head of the Research Laboratory from ESET Latin America.
“Several leaks that we are aware of are carried out through the use of web scraping, but the use can also be non-malicious. So that a collection of information does not have malicious characteristics, it is interesting to understand how to shape it ”, he added.
- DDoS -denial of service- risk: Web scraping solutions through the command line tend to get information faster, but if not parameterized correctly, they can generate such a large number of requests that they can be interpreted as a DDoS attack and there may be a temporary or permanent block of the IP you are scanning.
- Depending on the site’s protection systems, the source IP may be blacklisted and other sites may reject connections from the source that initiated the web scraping.
If you want to venture into studies on data scraping, it may be useful to learn how to adjust the number of requests per second, how many seconds there will be between one request and another, if there is the possibility of changing the web client that will be sent in the requests and configure a maximum number of information collection so that, if this number is reached, the scraping process is interrupted.
more tips
Since it is a very specific tool that has its impact perceived mainly by administrators of sites and services accessible through the web, ESET shares some security tips that can help to deal more adequately with web scraping:
- Don’t worry too much about blocking: it’s worth remembering that scraping is just an access to information and it can be unproductive to worry about blocking it. Instead, try to direct efforts to ensure legitimate access to information.
- Make sure that a person’s data is accessible only by that person: Adjust the information access authorization to prevent the entire database from being available to any user who is authenticated in the system.
- Adequately size the server resources taking into account the excess of connections that can occur periodically, to avoid any moment of interruption of the service.
- Correctly configure automatic blocks: Sometimes automatic blocks occur when requests reach a higher volume than expected.
- If you want to further limit eventual scraping, increase the sensitivity of the sensor, or otherwise, make sure that the blocks are not permanent, because the behavior of some browsers and users can generate excessive requests and the filters can end up blocking people or software.