币小站日志2--我的“币小站”是否违法?爬虫的法律边界在哪里?
案例分享
我在github上找到了一个项目,10日前才更新过,汇总了一些和爬虫相关的按键(https://github.com/HiddenStrawberry/Crawler_Illegal_Cases_In_China)
10天前是最后一次提交。读取之后发现
爬虫的法律边界到底在哪里?
软件在现实生活中是在太重要了,而现在又是个数据为王的时代。法律可能事无巨细的面面俱到
但是法律没规定不代表什么软件就真的可以为所欲为,我们真实世界里还是有写规则无论是在软件的数字世界,还是在游戏的虚拟世界都是应当遵守的
- 1、损害他人物权的事不能做。
比如我看你家数据好多,我想拿来用用可以么?如果你没有不同意,那我觉得是可以的,就像我想看看你家院墙的画,你是不能告我的。但是虽然你家院墙在大街边,但我也不能把它推倒了看。对应到爬虫中,你爬人家的数据,加入损害到人家的服务器正常运行了,或者用***到别人机器人里面把别人数据删了。这就一定是不能做的,所以我们写爬虫的时候,不要过于追求效率,慢点没关系,拿别人数据的同时也尊重下别人。别开几百个线程把人家服务器搞死了。
- 2、别人不让看的数据就不要爬。
比如人家明明是vip才能看的数据,你非要爬出来让所有人都看得到。用院墙上的画的例子来说,别人既然用布蒙上了,你就不能把布撕开,在街上看妹子是允许的,撕掉衣服看怕是要入刑哦。
- 3、私人数据不能爬!!!!!
原来币圈那么牛逼的公信宝就是因为这事进去的,具体进去的原因可能有2个。
1帮助非法p2p收集用户信用信息罪,其实他好像就写了个爬信用的插件。
2非法爬取用户隐私数据。
爬虫的模糊地带在哪里?
虽然上面说了一大堆,好像很明确似的,其实是很模糊的。比如
- 怎样的负载才算是不影响别人的服务器运行呢
- 公开文章的版权怎么算呢?
最终爬虫与被爬的网站达成的默契是什么?
没错,虽然模糊地带很多,但是有一条没写道法律里,但是我们都在公认的执行,那就是robots.txt.
他规定了网站的拥有者允许那些人爬哪些数据,哪些不允许爬
看下csdn的robots.txtUser-agent: * Disallow: /scripts Disallow: /public Disallow: /css/ Disallow: /images/ Disallow: /content/ Disallow: /ui/ Disallow: /js/ Disallow: /scripts/ Disallow: /article_preview.html* Disallow: /tag/ Disallow: /*?* Disallow: /link/
Sitemap: http://www.csdn.net/article/sitemap.txt
里面明确规定了,不要去爬他的资源网站,还有没有被归类的预览网站。但其他没有限制的,理论上你是可以爬的。
在看看爬虫的鼻祖,搜索引擎的
先看看百度的
User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: MSNBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Baiduspider-image
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: YoudaoBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sogou web spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sogou inst spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sogou spider2
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sogou blog
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sogou News Spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sogou Orion spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: ChinasoSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: Sosospider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: yisouspider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: EasouSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
User-agent: *
Disallow: /
我们重点关注一个
User-agent: *
Disallow: /
不是他指定的搜索引擎,那么你一个数据都不许爬!
再看看google的
User-agent:
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=&
Allow: /?hl=&gws_rd=ssl$
Disallow: /?hl=&&gws_rd=ssl
Allow: /?gws_rd=ssl$
Allow: /?pt1=true$
Disallow: /imgres
Disallow: /u/
Disallow: /preferences
Disallow: /setprefs
Disallow: /default
Disallow: /m?
Disallow: /m/
Allow: /m/finance
Disallow: /wml?
Disallow: /wml/?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/?
Disallow: /xhtml/search?
Disallow: /xml?
Disallow: /imode?
Disallow: /imode/?
Disallow: /imode/search?
Disallow: /jsky?
Disallow: /jsky/?
Disallow: /jsky/search?
Disallow: /pda?
Disallow: /pda/?
Disallow: /pda/search?
Disallow: /sprint_xhtml
Disallow: /sprint_wml
Disallow: /pqa
Disallow: /palm
Disallow: /gwt/
Disallow: /purchases
Disallow: /local?
Disallow: /localurl
Disallow: /shihui?
Disallow: /shihui/
Disallow: /products?
Disallow: /product
Disallow: /products_
Disallow: /products;
Disallow: /print
Disallow: /books/
Disallow: /bkshp?q=
Disallow: /books?q=
Disallow: /books?output=
Disallow: /books?pg=
Disallow: /books?jtp=
Disallow: /books?jscmd=
Disallow: /books?buy=
Disallow: /books?zoom=
Allow: /books?q=related:
Allow: /books?q=editions:
Allow: /books?q=subject:
Allow: /books/about
Allow: /booksrightsholders
Allow: /books?zoom=1
Allow: /books?zoom=5
Allow: /books/content?zoom=1
Allow: /books/content?zoom=5
Disallow: /ebooks/
Disallow: /ebooks?q=
Disallow: /ebooks?output=
Disallow: /ebooks?pg=
Disallow: /ebooks?jscmd=
Disallow: /ebooks?buy=
Disallow: /ebooks?zoom=
Allow: /ebooks?q=related:
Allow: /ebooks?q=editions:
Allow: /ebooks?q=subject:
Allow: /ebooks?zoom=1
Allow: /ebooks?zoom=5
Disallow: /patents?
Disallow: /patents/download/
Disallow: /patents/pdf/
Disallow: /patents/related/
Disallow: /scholar
Disallow: /citations?
Allow: /citations?user=
Disallow: /citations?cstart=
Allow: /citations?view_op=new_profile
Allow: /citations?view_op=top_venues
Allow: /scholarshare
Disallow: /s?
Allow: /maps?output=classic
Allow: /maps?*file=
Allow: /maps/d/
Disallow: /maps?
Disallow: /mapstt?
Disallow: /mapslt?
Disallow: /maps/stk/
Disallow: /maps/br?
Disallow: /mapabcpoi?
Disallow: /maphp?
Disallow: /mapprint?
Disallow: /maps/api/js/
Allow: /maps/api/js
Disallow: /maps/api/place/js/
Disallow: /maps/api/staticmap
Disallow: /maps/api/streetview
Disallow: /maps//sw/manifest.json
Disallow: /mld?
Disallow: /staticmap?
Disallow: /maps/preview
Disallow: /maps/place
Disallow: /maps/timeline/
Disallow: /help/maps/streetview/partners/welcome/
Disallow: /help/maps/indoormaps/partners/
Disallow: /lochp?
Disallow: /center
Disallow: /ie?
Disallow: /blogsearch/
Disallow: /blogsearch_feeds
Disallow: /advanced_blog_search
Disallow: /uds/
Disallow: /chart?
Disallow: /transit?
Allow: /calendar$
Allow: /calendar/about/
Disallow: /calendar/
Disallow: /cl2/feeds/
Disallow: /cl2/ical/
Disallow: /coop/directory
Disallow: /coop/manage
Disallow: /trends?
Disallow: /trends/music?
Disallow: /trends/hottrends?
Disallow: /trends/viz?
Disallow: /trends/embed.js?
Disallow: /trends/fetchComponent?
Disallow: /trends/beta
Disallow: /trends/topics
Disallow: /musica
Disallow: /musicad
Disallow: /musicas
Disallow: /musicl
Disallow: /musics
Disallow: /musicsearch
Disallow: /musicsp
Disallow: /musiclp
Disallow: /urchin_test/
Disallow: /movies?
Disallow: /wapsearch?
Allow: /safebrowsing/diagnostic
Allow: /safebrowsing/report_badware/
Allow: /safebrowsing/report_error/
Allow: /safebrowsing/report_phish/
Disallow: /reviews/search?
Disallow: /orkut/albums
Disallow: /cbk
Disallow: /recharge/dashboard/car
Disallow: /recharge/dashboard/static/
Disallow: /profiles/me
Allow: /profiles
Disallow: /s2/profiles/me
Allow: /s2/profiles
Allow: /s2/oz
Allow: /s2/photos
Allow: /s2/search/social
Allow: /s2/static
Disallow: /s2
Disallow: /transconsole/portal/
Disallow: /gcc/
Disallow: /aclk
Disallow: /cse?
Disallow: /cse/home
Disallow: /cse/panel
Disallow: /cse/manage
Disallow: /tbproxy/
Disallow: /imesync/
Disallow: /shenghuo/search?
Disallow: /support/forum/search?
Disallow: /reviews/polls/
Disallow: /hosted/images/
Disallow: /ppob/?
Disallow: /ppob?
Disallow: /accounts/ClientLogin
Disallow: /accounts/ClientAuth
Disallow: /accounts/o8
Allow: /accounts/o8/id
Disallow: /topicsearch?q=
Disallow: /xfx7/
Disallow: /squared/api
Disallow: /squared/search
Disallow: /squared/table
Disallow: /qnasearch?
Disallow: /app/updates
Disallow: /sidewiki/entry/
Disallow: /quality_form?
Disallow: /labs/popgadget/search
Disallow: /buzz/post
Disallow: /compressiontest/
Disallow: /analytics/feeds/
Disallow: /analytics/partners/comments/
Disallow: /analytics/portal/
Disallow: /analytics/uploads/
Allow: /alerts/manage
Allow: /alerts/remove
Disallow: /alerts/
Allow: /alerts/$
Disallow: /ads/search?
Disallow: /ads/plan/action_plan?
Disallow: /ads/plan/api/
Disallow: /ads/hotels/partners
Disallow: /phone/compare/?
Disallow: /travel/clk
Disallow: /travel/hotelier/terms/
Disallow: /hotelfinder/rpc
Disallow: /hotels/rpc
Disallow: /commercesearch/services/
Disallow: /evaluation/
Disallow: /chrome/browser/mobile/tour
Disallow: /compare//apply
Disallow: /forms/perks/
Disallow: /shopping/suppliers/search
Disallow: /ct/
Disallow: /edu/cs4hs/
Disallow: /trustedstores/s/
Disallow: /trustedstores/tm2
Disallow: /trustedstores/verify
Disallow: /adwords/proposal
Disallow: /shopping/product/
Disallow: /shopping/seller
Disallow: /shopping/ratings/account/metrics
Disallow: /shopping/reviewer
Disallow: /about/careers/applications/
Disallow: /landing/signout.html
Disallow: /webmasters/sitemaps/ping?
Disallow: /ping?
Disallow: /gallery/
Disallow: /landing/now/ontap/
Allow: /searchhistory/
Allow: /maps/reserve
Allow: /maps/reserve/partners
Disallow: /maps/reserve/api/
Disallow: /maps/reserve/search
Disallow: /maps/reserve/bookings
Disallow: /maps/reserve/settings
Disallow: /maps/reserve/manage
Disallow: /maps/reserve/payment
Disallow: /maps/reserve/receipt
Disallow: /maps/reserve/sellersignup
Disallow: /maps/reserve/payments
Disallow: /maps/reserve/feedback
Disallow: /maps/reserve/terms
Disallow: /maps/reserve/m/
Disallow: /maps/reserve/b/
Disallow: /maps/reserve/partner-dashboard
Disallow: /about/views/
Disallow: /intl/*/about/views/
Disallow: /local/dining/
Disallow: /local/place/products/
Disallow: /local/place/reviews/
Disallow: /local/place/rap/
Disallow: /local/tab/
Allow: /finance
Allow: /js/
AdsBot
User-agent: AdsBot-Google
Disallow: /maps/api/js/
Allow: /maps/api/js
Disallow: /maps/api/place/js/
Disallow: /maps/api/staticmap
Disallow: /maps/api/streetview
Certain social media sites are whitelisted to allow crawlers to access page markup when links to google.com/imgres* are shared. To learn more, please contact [email protected]
User-agent: Twitterbot
Allow: /imgres
User-agent: facebookexternalhit
Allow: /imgres
Sitemap: https://www.google.com/sitemap.xml
google的比较长,但是我们只需要关注一个
User-agent: *
这说明,只要它允许的,我们都能爬!
## 最后,我写的小站是否违法了?
- 1、我没有让对方服务器瘫痪,或者增加压力的可能
我的爬取设置的是10min一次,一次爬3,5个新闻,而且没有加载别人的图片和js之类的
- 2、这些网站的robots里面并没有限制我爬取
- 3、我没有损害这些网站的利益
我虽然也拿到了这些网站的新闻内容,但是我并没有直接展示,而是需要打开文章原来的位置
原文地址:https://blog.51cto.com/14633800/2456580