如何使你的Ajax应用内容可让搜索引擎爬行

This document outlines the steps that are necessary in order to make your AJAX application crawlable. Once you have fully understood each of these steps, it should not take you very long to actually make your application crawlable! However, you do need to understand each of the steps involved, so we recommend reading this guide in its entirety.

Overview of Solution

Briefly, the solution works as follows: the crawler finds a pretty AJAX URL (that is, a URL containing a #! hash fragment). It then requests the content for this URL from your server in a slightly modified form. Your web server returns the content in the form of an HTML snapshot, which is then processed by the crawler. The search results will show the original URL.

Step-by-step guide

1. Indicate to the crawler that your site supports the AJAX crawling scheme

The first step to getting your AJAX site indexed is to indicate to the crawler that your site supports the AJAX crawling scheme. The way to do this is to use a special token in your hash fragments (that is, everything after the # sign in a URL): hash fragments have to begin with an exclamation mark. For example, if your AJAX app contains a URL like this:

www.example.com/ajax.html#key=value
it should now become this:

www.example.com/ajax.html#!key=value
When your site adopts the scheme, it will be considered "AJAX crawlable." This means that the crawler will see the content of your app if your site supplies HTML snapshots.

2. Set up your server to handle requests for URLs that contain _escaped_fragment_

Suppose you would like to get www.example.com/index.html#!key=value indexed. Your part of the agreement is to provide the crawler with an HTML snapshot of this URL, so that the crawler sees the content. How will your server know when to return an HTML snapshot instead of a regular page? The answer is the URL that is requested by the crawler: the crawler will modify each AJAX URL such as

www.example.com/ajax.html#!key=value
to temporarily become

www.example.com/ajax.html?_escaped_fragment_=key=value
You may wonder why this is necessary. There are two very important reasons:

Hash fragments are never (by specification) sent to the server as part of an HTTP request. In other words, the crawler needs some way to let your server know that it wants the content for the URL www.example.com/ajax.html#!key=value (as opposed to simply www.example.com/ajax.html).
Your server, on the other hand, needs to know that it has to return an HTML snapshot, rather than the normal page sent to the browser. Remember: an HTML snapshot is all the content that appears on the page after the JavaScript has been executed. Your server‘s end of the agreement is to return the HTML snapshot for www.example.com/index.html#!key=value (that is, the original URL!) to the crawler.
Note: The crawler escapes certain characters in the fragment during the transformation. To retrieve the original fragment, make sure to unescape all %XX characters in the fragment. More specifically, %26 should become &, %20 should become a space, %23 should become #, and %25 should become %, and so on.

Now that you have your original URL back and you know what content the crawler is requesting, you need to produce an HTML snapshot. How do you do that? There are various ways; here are some of them:

If a lot of your content is produced with JavaScript, you may want to use a headless browser such as HtmlUnit to obtain the HTML snapshot. Alternatively, you can use a different tool such as crawljax or watij.com.
If much of your content is produced with a server-side technology such as PHP or ASP.NET, you can use your existing code and only replace the JavaScript portions of your web page with static or server-side created HTML.
You can create a static version of your pages offline, as is the current practice. For example, many applications draw content from a database that is then rendered by the browser. Instead, you may create a separate HTML page for each AJAX URL.
It‘s highly recommended that you try out your HTML snapshot mechanism. It‘s important to make sure that the headless browser indeed renders the content of your application‘s state correctly. Surely you‘ll want to know what the crawler will see, right? To do this, you can write a small test application and see the output, or you can use a tool such as Fetch as Googlebot.

To summarize, make sure the following happens on your server:

A request URL of the form www.example.com/ajax.html?_escaped_fragment_=key=value is mapped back to its original form: www.example.com/ajax.html#!key=value.
The token is URL unescaped. The easiest way to do this is to use standard URL decoding. For example, in Java you would do this:
mydecodedfragment = URLDecoder.decode(myencodedfragment, "UTF-8");
An HTML snapshot is returned, ideally along with a prominent link at the top of the page, letting end users know that they have reached the _escaped_fragment_ URL in error. (Remember that _escaped_fragment_ URLs are meant to be used only by crawlers.) For all requests that do not have an _escaped_fragment_, the server will return content as before.

3. Handle pages without hash fragments

Some of your pages may not have hash fragments. For example, you might want your home page to be www.example.com, rather than www.example.com#!home. For this reason, we have a special provision for pages without hash fragments.

Note:Make sure you use this option only for pages that contain dynamic, Ajax-created content. For pages that have only static content, it would not give extra information to the crawler, but it would put extra load on your and Google‘s servers.

In order to make pages without hash fragments crawlable, you include a special meta tag in the head of the HTML of your page. The meta tag takes the following form:
<meta name="fragment" content="!">
This indicates to the crawler that it should crawl the ugly version of this URL. As per the above agreement, the crawler will temporarily map the pretty URL to the corresponding ugly URL. In other words, if you place <meta name="fragment" content="!"> into the page www.example.com, the crawler will temporarily map this URL to www.example.com?_escaped_fragment_= and will request this from your server. Your server should then return the HTML snapshot corresponding to www.example.com. Please note that one important restriction applies to this meta tag: the only valid content is "!". In other words, the meta tag will always take the exact form: <meta name="fragment" content="!">, which indicates an empty hash fragment, but a page with AJAX content.

4. Consider updating your Sitemap to list the new AJAX URLs

Crawlers use Sitemaps to complement their discovery crawl. Your Sitemap should include the version of your URLs that you‘d prefer to have displayed in search results, so in most cases it would be http://example.com/ajax.html#!key=value. Do not include links such as http://example.com/ajax.html?_escaped_fragment_=key=value in the Sitemap. Googlebot does not follow links that contain _escaped_fragment_! If you have an entry page to your site, such as your homepage, that you would like displayed in search results without the #!, then add this URL to the Sitemap as is. For instance, if you want this version displayed in search results:

http://example.com/
then include

http://example.com/
in your Sitemap and make sure that <meta name="fragment" content="!"> is included in the head of the HTML document. For more information, check out our additional articles on Sitemaps.

5. Optionally, but importantly, test the crawlability of your app: see what the crawler sees with "Fetch as Googlebot".

Google provides a tool that will allow you to get an idea of what the crawler sees, Fetch as Googlebot. You should use this tool to see whether your implementation is correct and whether the bot can now see all the content you want a user to see. It is also important to use this tool to ensure that your site is not cloaking.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies.

Last updated June 18, 2014.

参考:

https://developers.google.com/webmasters/ajax-crawling/docs/getting-started

https://developers.google.com/webmasters/ajax-crawling/docs/learn-more

时间: 2024-10-11 09:58:25

如何使你的Ajax应用内容可让搜索引擎爬行的相关文章

PHP curl 抓取AJAX异步内容

其实抓ajax异步内容的页面和抓普通的页面区别不大.ajax只不过是做了一次异步的http请求,只要使用firebug类似的工具,找到请求的后端服务url和传值的参数,然后对该url传递参数进行抓取即可. 利用Firebug的网络工具 如果抓去的是页面,则内容中没有显示的数据,是一堆JS代码. Code $cookie_file=tempnam('./temp','cookie'); $ch = curl_init(); $url1 = "http://www.cdut.edu.cn/defau

struts2,json,ajax整合内容记录

使用ssh三大框架整合时关于struts2,json,ajax整合内容记录.这里写主要部分代码 action部分: 注意事项,action部分的返回值要有set和get方法,否则会报错. package com.hcj.action; import net.sf.json.JSONObject; import com.hcj.dto.User; import com.hcj.service.UserService; import com.opensymphony.xwork2.ActionSup

PHPcurl抓取AJAX异步内容(转载)

PHPcurl抓取AJAX异步内容 其实抓ajax异步内容的页面和抓普通的页面区别不大.ajax只不过是做了一次异步的http请求,只要使用firebug类似的工具,找到请求的后端服务url和传值的参数,然后对该url传递参数进行抓取即可. 利用Firebug的网络工具 如果抓去的是页面,则内容中没有显示的数据,是一堆JS代码. Code $cookie_file=tempnam('./temp','cookie'); $ch = curl_init(); $url1 = "http://www

基于内容的视频搜索引擎

项目介绍:随着视频类型的增加和数据量的日益庞大,如何有效地组织和管理这些数据,使人们能够方便地从大量视频数据中找到自己感兴趣的相关视频片段已成为一种迫切的需求,而能够满足这一需求的技术便是目前人们普遍关注的基于内容的视频检索技术(CBVR, Content-Based Video Retrieval).CBVR相对于传统的检索系统而言有很大的不同.一方面,CBVR中检索的对象不再是视频数据本身,而是从视频数据中提取出的“内容”描述数据:另一方面,检索的方式也更加多样化,可以像文本检索系统一样,通

Windows Phone中扩展WebBrowser使其支持绑定html内容

在WP开发中,有时候会用到WebBrowser控件来展示一些html内容,这个控件有很多局限性,比如不支持绑定内容,这样的MVVM模式中就无法进行内容的绑定.为了实现这个目的,需要扩展一下,具体代码如下: /// <summary> /// 用于绑定WebBrowser控件的html内容 用法:ext:WebBrowserProperties.Body="{Binding CurrentArticleItem.Html}" /// </summary> publ

HTML5 viewport 标签与 CSS3 background-size 属性 使图片完全适应区域内容

要使一张图片不论在移动端还是在桌面端都能适应区域内容,可以使用 HTML5 的 viewport 标签结合 CSS3 的 background-size 属性. 适应区域内容是指图片的宽或高的长度满足浏览区的内容区域. HTML5 的 viewport 标签中的 content="width=device-width, initial-scale=1.0" 可以使图片的宽度自适应移动端设备的宽度,且初始缩放比例为1: CSS3 的 background-size:contian 把图像

angularjs的post请求参数的转换,使之跟ajax一样参数供springMVC使用

一般情况下,angularjs的post格式是(我的模板): angularjs的请求方式是: Content-Type: application/json 这样传过去的数据是这样子的:如传递一个数组:图中targetArr是数组  真正的格式是json格式啊,springMVC不好接收的:  展开来是这样的,总之就是springMVC的@RequestParam所不能接受的类型: 这种情况下,先说说正常可以行得通,不用在js转换格式的方式:用springMVC的@RequestBody去接受j

AJAX了解内容

1.什么是AJAX? AJAX 是与服务器交换数据并更新部分网页的艺术,在不重新加载整个页面的情况下. 2.简单的AJAX操作 这是项目的目录 a.第一步:写MyJsp.jsp <%@ page language="java" import="java.util.*" pageEncoding="UTF-8"%> <% String path = request.getContextPath(); String basePath

AJAX分页 (内容涉及到 存储过程)

<1> 首先我们在数据库(SQL Server)中声明定义存储过程 use sales if(exists(select * from sys.objects where name='proc_location_Paging')) drop proc proc_location_Paging go create proc proc_location_Paging --创建存储过程 ( @pageSize int, --页大小 @currentpage int, --当前页 @rowCount