Automated content gathering
Information is Power
How to collect data from several websites automatically?
Very often Web user uses several websites to gather some important information on regular basis and it is really annoying.
Just imagine how useful would be if this information are gathered render to proper format and sent via email at exact time
I have made couple of scripts for myself which are scanning websites (authorization required) and sends me a report email if something valuable appears.
I am thinking of web service that will allow to specify data sources and process rules for data collected but need some confirmation from people if this service is really needed.
Automated Data Collection service Specifications
- Allow to collect data from websites which require authorization
- Allow to present data in different formats xml, csv, txt, doc
- Up to every minute check for data updates
Service is rather complex so gathering setup will require some developer attention.
How does it work?
Service could be divided into three main parts.
- 1. Built in browser which can get website pages content
- 2. Web page parser (unique for each page)
- 3. Data structuring
The third part is the simplest. We already have all parsed data and we can do whatever we want to with it.
The second part is unique for all pages, we should teach the script to get required words, images, numbers from page html. Sometimes this might be a complex task.
The first part is most complex. Anti-bot features sometimes impossible to avoid… Like image captcha - in basic situation we can get the code from it automatically but in general it is too expensive.