运行数据分析

How to install and run the analytics backend locally:

We have had some troubles getting people up and running locally with the analytics backend, so I wrote up a quick guide for installation. If you run into any undocumented issues or trouble, please try to document it here. These instructions were performed on a clean install of Ubuntu 14.04.

Clone the analytics repositories

1. Navigate to the folder you want to install into and git clone these repositories:

  • edx/edx-analytics-data-api
  • edx/edx-analytics-pipeline
  • edx/edx-analytics-data-api-client
  • edx/edx-analytics-dashboard
cd {PATH_TO_EDX_FOLDER}/analytics
cd {PATH_TO_EDX_FOLDER}/analytics

git clone https://github.com/edx/edx-analytics-pipeline.git
git clone https://github.com/edx/edx-analytics-data-api.git
git clone https://github.com/edx/edx-analytics-data-api-client.git
git clone https://github.com/edx/edx-analytics-dashboard.git

2. Create virtual environments in which to run the repositories

  • It is best to create a separate virtual environment for each repository; otherwise, you may run into conflicts between their dependencies.
mkdir ~/.venvs
cd ~/.venvs
virtualenv edx-analytics-pipeline
virtualenv edx-analytics-data-api
virtualenv edx-analytics-data-api-client
virtualenv edx-analytics-dashboard

Install the dependencies

  • You will need to activate and deactivate each virtualenv in turn
  • Once you think the dependencies are installed, check them by running the repository‘s unit tests.
  • If the unit tests complete successfully, you will see output of the form "Ran X tests in Ys \n\n OK"

1. Installing edx-analytics-pipeline:

cd {PATH_TO_EDX_FOLDER}/analytics
cd edx-analytics-pipeline/
source ~/.venvs/edx-analytics-pipeline/bin/activate
make requirements
make test

If this raises a NoAuthHandlerFound error from boto, run:

export AWS_ACCESS_KEY_ID="TESTACCESSKEY"
export AWS_SECRET_ACCESS_KEY="TESTSECRET"
make test

To run this in production, we need to supply actual AWS credentials to boto, but the test suite does not care if they are valid.

deactivate
source ~/.venvs/edx-analytics-data-api/bin/activate

2. Installing edx-analytics-data-api:

cd ../edx-analytics-data-api
make develop
./manage.py migrate --noinput
./manage.py migrate --noinput --database=analytics
./manage.py set_api_key edx edx
make validate
deactivate

3. Installing edx-analytics-data-api-client:

cd ../edx-analytics-data-api-client/
source ~/.venvs/edx-analytics-data-api-client/bin/activate
pip install -r requirements.txt
make test
deactivate

4. Installing edx-analytics-dashboard:

cd ../edx-analytics-dashboard/
source ~/.venvs/edx-analytics-dashboard/bin/activate
sudo apt-get update
sudo apt-get install gettext
sudo apt-get install npm
sudo apt-get install openjdk-7-jre
sudo apt-get install openjdk-7-jdk
sudo apt-get install libxml2-dev libxslt-dev python-dev zlib1g-dev
make develop
make validate

If this raises an OfflineGenerationError for missing compression keys, run:

./manage.py compress --settings=analytics_dashboard.settings.test
make validate
deactivate

Run pipeline task locally and verify its completion

1. Install MySQL locally and create a credentials file for the pipeline

sudo apt-get install mysql-server
mysql -u root -p

CREATE USER ‘analytics‘@‘localhost‘ IDENTIFIED BY ‘edx‘;
GRANT ALL PRIVILEGES ON * . * TO ‘analytics‘@‘localhost‘;
FLUSH PRIVILEGES;

cd {PATH_TO_EDX_FOLDER}/analytics
vi mysql_creds
***BEGIN mysql_creds FILE***
{
"host": "127.0.0.1",
"port": "3306",
"username": "analytics",
"password": "edx",
"database": "analytics"
}
***END mysql_creds FILE***
cd edx-analytics-pipeline
vi override.cfg
***BEGIN override.cfg***
[database-export]
database = analytics
credentials = {PATH_TO_EDX_FOLDER}/analytics/mysql_creds

[database-import]
database = edxprod
destination = s3://<bucket for intermediate hadoop products>/intermediate/database-import
credentials = s3://<secrets bucket>/edxapp_prod_ro_mysql_creds

[event-logs]
expand_interval = 2 days
pattern = .*tracking.log-(?P<date>[0-9]+).*
source = s3://<bucket to where all tracking logs are synched>/tracking/

[hive]
warehouse_path = s3://<bucket for intermediate hadoop products>/warehouse/hive/

[manifest]
path = s3://<bucket for intermediate hadoop products>/user-activity-file-manifests/manifest
lib_jar = s3://<secrets bucket>/oddjob-1.0.1-standalone-modified.jar
input_format = oddjob.ManifestTextInputFormat

[enrollments]
blacklist_date = 2001-01-01
blacklist_path = /tmp/blacklist

[answer-distribution]
valid_response_types = customresponse,choiceresponse,optionresponse,multiplechoiceresponse,numericalresponse,stringresponse,formularesponse
***END EXAMPLE override.cfg***

2. Acquire a log file (or create a dummy one)

mkdir /tmp/log_files
cd /tmp/log_files

At this point, you can either acquire a log file from S3 or another developer or use the dummy file below (Include the empty line at the end). Either way, place it in /tmp/log_files

vi tracking.log-20150101-1234567890
*** BEGIN DUMMY LOG FILE ***
{"username": "test_user", "host": "class.stanford.edu", "event_source": "server", "event_type": "problem_check", "context": {"course_id": "edX/DemoX/DemoCourse", "course_user_tags": {}, "user_id": 555555, "org_id": "Education", "module": {"display_name": "Quiz - Reasoning"}}, "time": "2014-06-23T16:17:16.856434+00:00", "ip": "0.0.0.0", "event": {"submission": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"input_type": "checkboxgroup", "question": "Choose as many as you like.", "response_type": "choiceresponse", "answer": ["Reasoning is the essence of what mathematics is", "Reasoning is useful for working in most jobs", "Reasoning allows people to connect ideas and make mathematical breakthroughs"], "variant": "", "correct": false}}, "success": "incorrect", "grade": 0, "correct_map": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"hint": "", "hintmode": null, "correctness": "incorrect", "npoints": null, "msg": "", "queuestate": null}}, "state": {"student_answers": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": ["choice_2"]}, "seed": 1, "done": true, "correct_map": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"hint": "", "hintmode": null, "correctness": "incorrect", "npoints": null, "msg": "", "queuestate": null}}, "input_state": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {}}}, "answers": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": ["choice_0", "choice_1", "choice_2"]}, "attempts": 2, "max_grade": 1, "problem_id": "i4x://edX/DemoX-S/problem/a58470ee54cc49ecb2bb7c1b1c0ab43a"}, "agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0", "page": "x_module"}
{"username": "test_user_alt", "host": "class.stanford.edu", "event_source": "server", "event_type": "problem_check", "context": {"course_id": "edX/DemoX/DemoCourse", "course_user_tags": {}, "user_id": 555556, "org_id": "Education", "module": {"display_name": "Quiz - Reasoning"}}, "time": "2014-06-22T16:17:16.856434+00:00", "ip": "0.0.0.0", "event": {"submission": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"input_type": "checkboxgroup", "question": "Choose as many as you like.", "response_type": "choiceresponse", "answer": ["Reasoning is the essence of what mathematics is", "Reasoning is useful for working in most jobs", "Reasoning allows people to connect ideas and make mathematical breakthroughs"], "variant": "", "correct": false}}, "success": "incorrect", "grade": 0, "correct_map": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"hint": "", "hintmode": null, "correctness": "incorrect", "npoints": null, "msg": "", "queuestate": null}}, "state": {"student_answers": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": ["choice_2"]}, "seed": 1, "done": true, "correct_map": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"hint": "", "hintmode": null, "correctness": "incorrect", "npoints": null, "msg": "", "queuestate": null}}, "input_state": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {}}}, "answers": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": ["choice_4", "choice_5", "choice_6"]}, "attempts": 2, "max_grade": 1, "problem_id": "i4x://edX/DemoX-S/problem/a58470ee54cc49ecb2bb7c1b1c0ab43a"}, "agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0", "page": "x_module"}
{"username": "test_user", "host": "class.stanford.edu", "event_source": "server", "event_type": "problem_check", "context": {"course_id": "edX/DemoX/DemoCourse", "course_user_tags": {}, "user_id": 555555, "org_id": "Education", "module": {"display_name": "Quiz - Reasoning"}}, "time": "2014-06-22T16:17:16.856434+00:00", "ip": "0.0.0.0", "event": {"submission": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"input_type": "checkboxgroup", "question": "Choose as many as you like.", "response_type": "choiceresponse", "answer": ["Reasoning is the essence of what mathematics is", "Reasoning is useful for working in most jobs", "Reasoning allows people to connect ideas and make mathematical breakthroughs"], "variant": "", "correct": false}}, "success": "incorrect", "grade": 0, "correct_map": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"hint": "", "hintmode": null, "correctness": "incorrect", "npoints": null, "msg": "", "queuestate": null}}, "state": {"student_answers": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": ["choice_2"]}, "seed": 1, "done": true, "correct_map": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"hint": "", "hintmode": null, "correctness": "incorrect", "npoints": null, "msg": "", "queuestate": null}}, "input_state": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {}}}, "answers": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": ["choice_4", "choice_5", "choice_6"]}, "attempts": 2, "max_grade": 1, "problem_id": "i4x://edX/DemoX-S/problem/a58470ee54cc49ecb2bb7c1b1c0ab43a"}, "agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0", "page": "x_module"}
*** END DUMMY LOG FILE ***

3. Run the API locally and query for results of the pipeline‘s aggregation

cd PATH_TO_EDX_FOLDER/analytics/edx-analytics-pipeline
source ~/.venvs/edx-analytics-pipeline/bin/activate
launch-task AnswerDistributionToMySQLTaskWorkflow --local-scheduler --remote-log-level DEBUG --include *tracking.log* --src /tmp/log_files --dest /tmp/answer_dist --mapreduce-engine local --name test_task
mysql -u root -p

USE ANALYTICS;
SELECT COUNT(*) FROM answer_distribution;

If the pipeline task ran successfully (and you used the dummy file above), this should be the output:

+----------+
| COUNT(*) |
+----------+
|        2 |
+----------+
1 row in set (0.00 sec)
exit
deactivate
cd ../edx-analytics-data-api
source ~/.venvs/edx-analytics-data-api/bin/activate
./manage.py runserver --settings=analyticsdataserver.settings.local_mysql

Verify that the data API can connect to the database

1. Navigate to 127.0.0.1:8000 in your web browser:

  • If the page does not display and you see ImproperlyConfigured: Error loading MySQLdb module in the logs, run: ‘pip install mysql-python‘
  • If the page indicates a 401 access forbidden error, you need to rerun: ‘./manage.py set_api_key edx edx‘

2. Click on the answer_distribution query modal and enter ‘i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1‘ into the box (or a different module_id from your logs if you didn‘t use the dummy log file from above)

3. Click to request the data from the API, and the results should match the log file from above (or whichever you used)

时间: 2024-11-05 04:49:41

运行数据分析的相关文章

物联网平台构架系列 (四):Amazon, Microsoft, IBM IoT 平台导论 之 平台

最近研究了一些物联网平台技术资料,以做选型参考.脑子里积累大量信息,便想写出来做一些普及.作为科普文章,力争通俗易懂,不确保概念严谨性.我会给考据癖者提供相关英文链接,以便深入研究. -- 冯立超 HiwebFrank 4. 平 台 由于物联网的地域分布广.设备数量众多的特点,物联网解决方案必须借助公有云平台来实现. 物联网解决方案须具备如下功能: - 从设备收集数据 - 分析移动中的数据流 - 存储和查询大型数据集 - 可视化实时和历史数据 - 与后端办公系统集成 - 管理设备 下图是微软给出

[转载] On Designing and Deploying Internet-Scale Services(译)-下

原文: http://duanple.blog.163.com/blog/static/709717672013511101045985/ James Hamilton – Windows Live Services Platform 2007 原文:http://www.mvdirona.com/jrh/TalksAndPapers/JamesRH_Lisa.pdf 译者:[email protected] 2013-06-10 译文:http://duanple.blog.163.com/b

39个大数据可视化工具

无论是在行政演示中为数据点创建一个可视化进程,还是用可视化概念来细分客户,数据可视化都显得尤为重要.本文将推荐39个可用于处理大数据的可视化工具. &amp;lt;img class="size-full wp-image-407608 aligncenter" src="http://image.woshipm.com/wp-files/2016/09/dashuju-1.png" alt="dashuju-1" width="

Hadoop之生态系统

Hadoop生态系统 一.摘要 经过几年的快速发展,Hadoop现在已经发展成为包含多个相关项目的软件生态系统.狭义的Hadoop核心只包括Hadoop Common.Hadoop HDFS和Hadoop MapReduce三个子项目,但是和Hadoop核心密切相关的还包括Avro.Zookeeper.Hive.Pig和Hbase等等项目,构建在这些项目之上的,面向具体领域.应用的Mahout.X-Rime.Crossbow和Ivory等项目,以及Chukwa.Flume.Sqoop.Oozie

2015年主宰大数据技术的五大发展趋势

大数据技术自出现以来以一种异常火热的速度发展着,且种种迹象表明这种发展趋势在2015年将会继续持续下去.MapR联合创始人兼首席执行官John Schroeder预测,2015年将有五大发展趋势主导大数据技术,MapR是致力于Hadoop分发版的专业公司. 仅仅几年时间里,大数据技术就从之前的炒作阶段逐渐发展成为新数字时代中的核心技术之一.2014年,企业内部的大数据计划慢慢地从测试阶段走向研发和生产.Schroeder表示,2015年,企业的大数据技术将会进一步推进,并向前发展,甚至会产生更多

交易系统升级之性能测试思路

(原创文章,转载请注明出处.) 随着资本市场活跃度的提升,A股行情日趋火爆.越来越多的互联网企业.私募机构参与其中.为满足投资者财富管理并为其提供更为便捷的金融服务的需要,核心交易系统重构成为提升机构服务能力的重要方式.交易系统性能是体现互联网证券业务能力的重要指标,如何确保新构建的交易系统能够满足针对互联网大数据量的业务需求成为重中之重.因此必须对交易系统的性能容量指标进行合理的评测,以满足经营机构中长期业务发展的需要.本文将从建模策略和数据采集.软硬件环境配置及性能监控指标定义等方面提出相关

新手必备:大数据框架Hadoop主要模块解析

hadoop Common: 包括Hadoop常用的工具类,由原来的Hadoop core部分更名而来.主要包括系统配置工具Configuration.远程过程调用RPC.序列化机制和Hadoop抽象文件系统FileSystem等.它们为在通用硬件上搭建云计算环境提供基本的服务,并为运行在该平台上的软件开发提供了所需的API. Hadoop Distributed File System (HDFS?): 分布式文件系统,提供对应用程序数据的高吞吐量,高伸缩性,高容错性的访问.是Hadoop体系

电厂锅炉安全:泄漏原因与预防措施整理

电厂锅炉,火电厂三大主力设备之一.在电厂生产过程中发挥了极其重要的作用,是火电厂生产所必不可少的重要设备.但是,近些年,因为锅炉安全管控不到位而造成的事故屡有发生,给电厂的财产及人员的生命造成极大威胁.分析电厂锅炉的泄漏原因,制定有针对性的应对预案,并做好检修工作,对于降低事故发生的风险,提升电厂锅炉安全性具有重大意义. 电厂锅炉泄漏原因分析 电厂锅炉泄露的原因有很多,从大的方面来看导致锅炉泄露的因素主要有锅炉自身设计问题.工作人员操作问题.锅炉内部水循环问题以及各部位受热程度问题,然而从结构来

0编码构建AI模型

近年来人工智能在语音识别.自然语言处理.计算机视觉等诸多领域取得了巨大成功,但人工智能要规模化商用仍然存在一些问题和挑战.例如面包店要自动识别面包种类及数量实现无人结算,如调用通用图像识别服务,虽然易用,但识别准确率低,无法准确区分面包种类:如定制化模型,虽然可以提升识别的准确率,但对于大多数企业或个人来说,聘请专家定制模型费用高.周期长,后续更新难而自行开发由于专业要求高,难以实现.ModelArts自动学习是一个帮助人们实现特定AI应用的低门槛.高灵活.零代码的定制化模型开发工具,目的是让每