創新及科技解決方案

解決方案編號

S-0462

解決方案名稱

智能網站數據抓取方案

解決方案描述

For businesses of all shape and sizes, whether start-ups or Fortune 100s, scraping the web for data to fuel your market research efforts offers the broadest and most insightful perspective of your industry. Manually acquiring data for market research is a mundane, arduous task - one, fortunately, easily automated by intelligently designed web crawlers.


In this connection, we offer a Website Extraction Solution that converts unstructured website data into structured ready-to-consume data. In this solution, we offer a self-built data automation platform (called DataCanva) that can scrap website information automatically, continuously and effortlessly, and perform various data transformation, and then output structured data ready for consumption through files, API, webhooks.


Our Website Extraction Solution has a number of proprietary technologies to enable data crawling at scale even on difficult sites:


a) Anti-ban: Our technology has strategies to emulate a human visit session to avoid banning.


b) Auto-queuing: While some sites have implementing auto queuing feature when the sites are overloading, our technology will enable the crawlers to queue up in virtual waiting room just like a human.


c) Login: While some sites require a valid credential and some session-related mechanics in order to load more data, our technology work seamlessly in these scenarios.


d) Deep crawling: Our technology does not only target at web pages, but also attachments such as WORD and PDF file.


e) Natural Language Analysis: Our technology can extract key phrases, key sentences and perform summarisation if needed.


f) Data Change Detection: Our technology extract delta change in data to minimize the data crawling workload and allow timely feedback.


g) Rotational Proxy: Our technology leverages a large pool of IP to decrease latency and improve success rate.


h) Screen capture: our technology saves the screen in PDF file for historical snapshot of the website for future review.

應用領域

廣播

城市管理

氣象

工商業

發展

教育

就業及勞工

環境

財經

食物

衛生

房屋

基礎設施

法律及保安

人口

康樂及文化

社會福利

運輸

使用的技術

人工智能

雲端運算

數據分析

深度學習

機器學習

自然語言處理

預測分析

使用例子

The Website Extraction Solution is suitable if the below use cases:


a) Market trend analysis

b) Price monitoring (e.g. on major E-commerce websites)

c) Research and development

d) Competitor analysis

e) News/alerts monitoring (i.e. good for compliance monitoring)

f) Profile analysis (i.e. retrieve data to enrich the user/company profile)

若政府部門欲對創科方案進行PoC試驗或技術測試,請聯絡Smart LAB。