For exercise I tried to discover companies from Slovenia which export their goods to Austria.  Let’s say we are interested in 10 companies with greatest income. It seems like an easy task doesn’t it? In fact all the data is publicly  available on http://sloexport.si but unfortunately ..

  • the website lacks ordering by company’s income
  • data exporting is limited to max 100 records.

If we wanted to get the data manually it would probably take us more days to finish the task. Can Rails help us do this? Read on …

What is the biggest problem?
The Website has many architectural problems: it does not store search parameters in URL , it uses server SESSION to store user’s settings and it requires JavaScript to work correctly. For users this means that they will not be able to send links with search results to their friends and they will not be able to create any bookmarks. For developers it means it will be harder to get data – but not imposible.

Heard of Copybara?
There is a gem called Copybara which we use in Unit tests mostly. Copybara helps us mimic modern browser environment including Javascript, sessions and proper CSS rendering. It is actually fun to see how it is mimicking user clicks on buttons, waiting until AJAX requests are finished or opens Javascript modal window and make some actions from then on.

We are attaching a video of data scraping in action and the code required to do this (less than 60 lines of code). The final result of the process is SQLLite database file which we can use to sort and filter crucial data.

Code of scraping spider is available here: https://gist.github.com/knagode/b0be5225e028d1d3c152