As we know in today’s modern digital world we have become accustomed to searching the World Wide Web quickly and easily. User-friendly results are listed and commonly optimized so they can be displayed on a variety of devices. The Web doesn’t just offer users and end-consumers new possibilities to extract information. Retrieving information only a few decades ago involved a cumbersome and tedious exercise of flicking through books, catalogues and any other printed materials available. Nowadays, information is available at the press of a button.
This type of easy accessibility is creating new problems for companies. In more and more cases, the information is what depicts the value of a company – but to make it freely accessible often brings potential risks with it. As a result, companies looking for a professional data extraction solution are faced with countless constraints.
Knowledge is Power
Based on one of Sir Francis Bacon’s famous quotes a lot of companies only offer limited access to their wealth of information. As an example, a dilemma promptly encountered by an online shop:
For the shop’s online presence it is essential to have enticing descriptions, good illustrations, easy to use and reliable search and filter capabilities. At the same time, the product portfolio and product details for each individual product represents a sensitive asset that needs to be protected from competitors. Using a digitization method realizes a new and faster way to automate data extraction and handle data analysis.
Lots of companies simply don’t allow access to their data interfaces and go to great lengths to implement security best practices to stop unauthorized access to their data. Generally the only possibility to access any data is achieved by using an everyday standard website search.
Based on prevention measures imposed by numerous companies and notably enforcing special requirements that need to be fulfilled, the flexible capabilities of existing tools used for data extraction fall short of providing satisfactory results.
This has resulted in countless companies having no choice but to build huge data centers in order to overcome the problem of compiling information. Once established the first step is to define a workflow, then split based on available resources, followed by applying a manual website search process to extract information.
This ad-hoc solution has several drawbacks pre-dominantly due to lack of alternatives available:
- High Error-Rate
Manual incorrect data entries und transmission errors lead to poor quality data
- Linear processing time dependent on data volume
The processing time increases linearly in proportion to the volume of data extracted
- Impractical Repetition Process
Repeating the process will only lead to doubling up on time and costs
- High Costs (applicable for medium sized data volumes)
The linear increase in processing time commonly transpires into high costs even for medium sized data volumes
An automated data extraction process needs to be developed for various products and competitor web shops and the results presented one after another in a structured report.
- Specification of the required information
At the beginning of the implementation phase, the first step is to define all information required for the analysis. In our example different product types and their respective compositions will need to be matched and subsequently the pricing compared.
- Manual Web Crawler
Once the information for the analysis is defined a manual website crawling process is activated. Before the process is activated it is essential to make sure the pre-defined information is available and simultaneously decide on which automated process should be applied.
- Web indexing (Create a Sitemap)
For the automated process it is important to compile a list of scanned pages after the completion of the web crawler process. This so-called Sitemap is available for several shops but must be compiled into one single document. Thereby extracting unique features of a product page, while scanning them at the same time. If the website conforms to the predefined criteria then it will be added to the sitemap.
- Saving extracted website data
Upon completion of generating the Sitemap, the contents of all the websites compiled will be downloaded simultaneously. This speeds up the consecutive data extraction process and enables the iterative scan of the data, irrespective of server outages or website updates. This process can specifically be aligned and adapted in accordance with the security policies in place for each of the shops.
- Data Parsers
For the compilation of the specified information, the screened results compiled in Point 2 will be used to extract the pre-defined information from the downloaded websites, e.g. using regular expressions. Captured data will be cached using the data-parsing tool that can be processed and used later on.
- Data Consolidation
In the last step, the captured data based on unique features cached in the previous step can now be matched and a data cleansing process can take place before the data is converted and subsequently populates a pre-defined template. The provision of data will depend on how the data needs to be used, e.g. CSV (Comma delimited) text file or an Excel Workbook.
In comparison to the manual data extraction process, the automated approach brings with it many advantages:
- Low Error-Rate
Eliminates cumbersome data entry
- Processing time no longer dependent on data volume
Downloading and analyzing the extracted data only takes a couple of hours
- Simplified Repetition Process
If the automated process needs to be repeated, or requirements revised or extended it will be more cost-effective and quicker based on the shorter processing time
- Low Costs (applicable for medium sized data volumes)
The short processing time reduces costs even for medium sized data volumes or if the process needs to be repeated
It is easy to weigh up the list of advantages in comparison to the implementation costs. After the initial screening and implementation of the basic functionality, further enhancements and performing further data extraction processes is available with little extra effort and at low cost.