I&T Solution

Reference No.

S-0462

Solution Name

Website Extraction Solution

Solution Description

For businesses of all shape and sizes, whether start-ups or Fortune 100s, scraping the web for data to fuel your market research efforts offers the broadest and most insightful perspective of your industry. Manually acquiring data for market research is a mundane, arduous task - one, fortunately, easily automated by intelligently designed web crawlers.


In this connection, we offer a Website Extraction Solution that converts unstructured website data into structured ready-to-consume data. In this solution, we offer a self-built data automation platform (called DataCanva) that can scrap website information automatically, continuously and effortlessly, and perform various data transformation, and then output structured data ready for consumption through files, API, webhooks.


Our Website Extraction Solution has a number of proprietary technologies to enable data crawling at scale even on difficult sites:


a) Anti-ban: Our technology has strategies to emulate a human visit session to avoid banning.


b) Auto-queuing: While some sites have implementing auto queuing feature when the sites are overloading, our technology will enable the crawlers to queue up in virtual waiting room just like a human.


c) Login: While some sites require a valid credential and some session-related mechanics in order to load more data, our technology work seamlessly in these scenarios.


d) Deep crawling: Our technology does not only target at web pages, but also attachments such as WORD and PDF file.


e) Natural Language Analysis: Our technology can extract key phrases, key sentences and perform summarisation if needed.


f) Data Change Detection: Our technology extract delta change in data to minimize the data crawling workload and allow timely feedback.


g) Rotational Proxy: Our technology leverages a large pool of IP to decrease latency and improve success rate.


h) Screen capture: our technology saves the screen in PDF file for historical snapshot of the website for future review.

Application Areas

Broadcasting

City Management

Climate and Weather

Commerce and Industry

Development

Education

Employment and Labour

Environment

Finance

Food

Health

Housing

Infrastructure

Law and Security

Population

Recreation and Culture

Social Welfare

Transport

Technologies Used

Artificial Intelligence (AI)

Cloud Computing

Data Analytics

Deep Learning

Machine Learning

Natural Language Processing

Predictive Analytics

Use Case

The Website Extraction Solution is suitable if the below use cases:


a) Market trend analysis

b) Price monitoring (e.g. on major E-commerce websites)

c) Research and development

d) Competitor analysis

e) News/alerts monitoring (i.e. good for compliance monitoring)

f) Profile analysis (i.e. retrieve data to enrich the user/company profile)

If any government department would like to conduct PoC trial or technology testing on the I&T solution, please contact Smart LAB.