劉任昌079影片

網路爬蟲

網路爬蟲

英語：web crawler，也叫網路蜘蛛（spider），是一種用來自動瀏覽全球資訊網的網路機器人。其目的一般為編纂網路索引。

網路搜尋引擎等站點通過爬蟲軟體更新自身的網站內容或其對其他網站的索引。網路爬蟲可以將自己所存取的頁面儲存下來，以便搜尋引擎事後生成索引供使用者搜尋。

爬蟲存取網站的過程會消耗目標系統資源。不少網路系統並不默許爬蟲工作。因此在存取大量頁面時，爬蟲需要考慮到規劃、負載，還需要講「禮貌」。不願意被爬蟲存取、被爬蟲主人知曉的公開站點可以使用robots.txt檔案之類的方法避免存取。這個檔案可以要求機器人只對網站的一部分進行索引，或完全不作處理。

網際網路上的頁面極多，即使是最大的爬蟲系統也無法做出完整的索引。因此在公元2000年之前的全球資訊網出現初期，搜尋引擎經常找不到多少相關結果。現在的搜尋引擎在這方面已經進步很多，能夠即刻給出高品質結果。

爬蟲還可以驗證超連結和HTML代碼，用於網路抓取（參見資料驅動編程）。

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically operated by search engines for the purpose of Web indexing (web spidering).[1] Web search engines and some other websites use Web crawling or spidering software to update their web content or indices of other sites' web content. Web crawlers copy pages for processing by a search engine, which indexes the downloaded pages so that users can search more efficiently. Crawlers consume resources on visited systems and often visit sites without approval. Issues of schedule, load, and "politeness" come into play when large collections of pages are accessed. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent. For example, including a robots.txt file can request bots to index only parts of a website, or nothing at all. The number of Internet pages is extremely large; even the largest crawlers fall short of making a complete index. For this reason, search engines struggled to give relevant search results in the early years of the World Wide Web, before 2000. Today, relevant results are given almost instantly. Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping and data-driven programming.

以上轉貼自維基百科。

網路爬蟲