量化Hacker News 中50天的数据 Quantifying Hacker News with 50 days of data

Quantifying Hacker News

I thought it would be fun to analyze the activity on one of my favorite sources of interesting links and information, Hacker News. My source of data is a script I‘ve set up some time in August that downloads HN (the Front page and the New stories page) every minute. We will be interested in visualizing the stories as they get upvoted during the day, figuring out which domains/users are most popular, what topics are most popular, and the best time to post a story. I‘m making all my data and code (Python data collection scripts + IPython Notebook for analysis) available in case you‘d like to carry out a similar analysis.

Data collection protocol

I set up a very simple python script that scrapes the HN front page and the new stories page every minute. A single day of data begins at 4am (PST) and ends at 4am the next day. The .html files are saved compressed as gzipped pickles and one day occupies roughly 10mb in this format. I had bring down my machine for a few days a few times so there are some gaps in the data, but in the end we get 47 days of data from period between August 22 and October 30.

Raw HTML data parsing

The parsing Python script uses BeautifulSoup to convert the raw HTML into a more structured JSON. This script was by the way by no means simple to write -- HN is based on unstructured tables and I had to discover many strange edge cases in its behavior along the way. At the end I ended up with a 100-line ugliest-parsing-function-ever (really, I‘m not proud of it) but it works and outputs something like the following for a single story at a specific snapshot:

{
‘domain‘: u‘play.google.com‘, ‘title‘: u‘Nexus 5‘,
‘url‘: u‘https://play.google.com/store/devices/details?id=nexus_5_black_16gb‘,
‘num_comments‘: 42, ‘rank‘: 1, ‘points‘: 65,
‘user‘: u‘sonier‘, ‘minutes_ago‘: 39, ‘id‘: u‘6648519‘
}

We get 60 such entries every minute (30 for front page and 30 for new page) and these are again all saved to disk. We are now ready to bring out the IPython Notebook and get to the juicy analysis!

The Analysis: Detailed analysis

Head over to the IPython Notebook rendered as HTML for the analysis:

Note: I had the entire dataset and .ipynb Ipython Notebook source available for download but recently took it down to save space on my host (sorry).

from: http://karpathy.github.io/2013/11/27/quantifying-hacker-news/

时间: 2024-11-05 20:36:34

量化Hacker News 中50天的数据 Quantifying Hacker News with 50 days of data的相关文章

如何用python提取Excel中指定列名的数据

#coding=utf-8 import xlrd,chardet,traceback #根据列名获取相应序号 def getColumnIndex(table,columnName): columnIndex=None for i in range(table.ncols): if(table.cell_value(0,i)==columnName): columnIndex=i break return columnIndex #根据Excel中sheet名称读取数据 def readExc

JAVA中使用JSON进行数据传递

在接口的访问和数据的传输方面使用的比较多的是使用JSON对象来操作格式化数据:在服务器端采用JSON字符串来传递数据并在WEB前端或者Android客户端使用JSON来解析接收到的数据. 首先,在JAVA中使用JSON需要引入 org.json 包(点击 这里 可以下载相应的JAR包!), 并在程序中引入相应的JSON类: import org.json.JSONArray; import org.json.JSONException; import org.json.JSONObject; 其

在JSP页面中输出JSON格式数据

JSON-taglib是一套使在JSP页面中输出JSON格式数据的标签库. JSON-taglib主页: http://json-taglib.sourceforge.net/index.htmlJAR包下载地址: http://sourceforge.net/projects/json-taglib/files/latest/download 使用方法:1.下载json-taglib.jar,将其放到WEB-INF/lib目录2.在jsp页面中做如下声明:<%@ taglib prefix=&qu

Sql server的Merge语句,源表中如果有重复数据会导致执行报错

用过sql server的Merge语句的开发人员都应该很清楚Merge用来做表数据的插入/更新是非常方便的,但是其中有一个问题值得关注,那就是Merge语句中的源表中不能出现重复的数据,我们举例来说明这个问题. 现在我们有一张表叫T_Class_A,其建表语句如下: CREATE TABLE [dbo].[T_Class_A]( [ID] [int] IDENTITY(1,1) NOT NULL, [ClassName] [nvarchar](50) NULL, [StudentTotalCo

ios开发中的4种数据持久化方式【二、数据库 SQLite3、Core Data 的运用】

               在上文,我们介绍了ios开发中的其中2种数据持久化方式:属性列表.归档解档.本节将继续介绍另外2种iOS持久化数据的方法:数据库 SQLite3.Core Data 的运用: 在本节,将通过对4个文本框内容的创建.修改,退出后台,再重新回到后台,来认识这两种持久化数据的方式.效果图如下[图1]: [图1 GUI界面效果图] [本次开发环境: Xcode:7.2     iOS Simulator:iphone6S plus   By:啊左]     一.数据库SQL

SQL Server批量向表中插入多行数据语句

因自己学习测试需要,需要两个有大量不重复行的表,表中行数越多越好.手动编写SQL语句,通过循环,批量向表中插入数据,考虑到避免一致问题,设置奇偶行不同.个人水平有限,如有错误,还望指正. 语句如下: 1 --批量向表中插入大量数据语句(奇偶不同) 2 3 --判断测试表是否存在,存在则先删除再创建 4 if exists(select 1 from sysobjects where xtype='u' and name='table_test' ) 5 drop table table_test

Android程序中Acticity间传递数据

在Android开发过程中,在不同的Acitivity之间传递数据的情况是非常常见的.我花费了一点时间来总结Acitivity之间的数据传递,记录下来. 1.简单传递键值对 这种传递方式非常简单,只需要在构造Intent加入相应的键值对. 在ActivityA中,调用Intent的代码如下: 1 Intent i = new Intent(ActivityA.this,ActivityB.class); 2 i.putExtra("name", "Finlay Liu&quo

转:SQL SERVER数据库中实现快速的数据提取和数据分页

探讨如何在有着1000万条数据的MS SQL SERVER数据库中实现快速的数据提取和数据分页.以下代码说明了我们实例中数据库的“红头文件”一表的部分数据结构: CREATE TABLE [dbo].[TGongwen] (    --TGongwen是红头文件表名 [Gid] [int] IDENTITY (1, 1) NOT NULL , --本表的id号,也是主键 [title] [varchar] (80) COLLATE Chinese_PRC_CI_AS NULL ,  --红头文件

html页面下拉列表中动态添加后台数据(格式化数据,显示出数据的层次感)

html页面下拉列表中动态添加后台数据(格式化数据,显示出数据的层次感) 效果图: 运行原理和技术: 当页面加载完毕,利用jquery向后台发送ajax请求,去后台拼接<select></select>中的option字符串.让后将字符串响应回来,动态添加到<select>中.其中的字符串中包含了后台的数据. 页面js代码: 1 <script type="text/javascript"> 2 //加载部门 3 function loa