运行数据分析

How to install and run the analytics backend locally:

We have had some troubles getting people up and running locally with the analytics backend, so I wrote up a quick guide for installation. If you run into any undocumented issues or trouble, please try to document it here. These instructions were performed on a clean install of Ubuntu 14.04.

Clone the analytics repositories

1. Navigate to the folder you want to install into and git clone these repositories:

edx/edx-analytics-data-api
edx/edx-analytics-pipeline
edx/edx-analytics-data-api-client
edx/edx-analytics-dashboard

cd {PATH_TO_EDX_FOLDER}/analytics
cd {PATH_TO_EDX_FOLDER}/analytics

git clone https://github.com/edx/edx-analytics-pipeline.git
git clone https://github.com/edx/edx-analytics-data-api.git
git clone https://github.com/edx/edx-analytics-data-api-client.git
git clone https://github.com/edx/edx-analytics-dashboard.git

2. Create virtual environments in which to run the repositories

It is best to create a separate virtual environment for each repository; otherwise, you may run into conflicts between their dependencies.

mkdir ~/.venvs
cd ~/.venvs
virtualenv edx-analytics-pipeline
virtualenv edx-analytics-data-api
virtualenv edx-analytics-data-api-client
virtualenv edx-analytics-dashboard

Install the dependencies

You will need to activate and deactivate each virtualenv in turn
Once you think the dependencies are installed, check them by running the repository‘s unit tests.
If the unit tests complete successfully, you will see output of the form "Ran X tests in Ys \n\n OK"

1. Installing edx-analytics-pipeline:

cd {PATH_TO_EDX_FOLDER}/analytics
cd edx-analytics-pipeline/
source ~/.venvs/edx-analytics-pipeline/bin/activate
make requirements
make test

If this raises a NoAuthHandlerFound error from boto, run:

export AWS_ACCESS_KEY_ID="TESTACCESSKEY"
export AWS_SECRET_ACCESS_KEY="TESTSECRET"
make test

To run this in production, we need to supply actual AWS credentials to boto, but the test suite does not care if they are valid.

deactivate
source ~/.venvs/edx-analytics-data-api/bin/activate

2. Installing edx-analytics-data-api:

cd ../edx-analytics-data-api
make develop
./manage.py migrate --noinput
./manage.py migrate --noinput --database=analytics
./manage.py set_api_key edx edx
make validate
deactivate

3. Installing edx-analytics-data-api-client:

cd ../edx-analytics-data-api-client/
source ~/.venvs/edx-analytics-data-api-client/bin/activate
pip install -r requirements.txt
make test
deactivate

4. Installing edx-analytics-dashboard:

cd ../edx-analytics-dashboard/
source ~/.venvs/edx-analytics-dashboard/bin/activate
sudo apt-get update
sudo apt-get install gettext
sudo apt-get install npm
sudo apt-get install openjdk-7-jre
sudo apt-get install openjdk-7-jdk
sudo apt-get install libxml2-dev libxslt-dev python-dev zlib1g-dev
make develop
make validate

If this raises an OfflineGenerationError for missing compression keys, run:

./manage.py compress --settings=analytics_dashboard.settings.test
make validate
deactivate

Run pipeline task locally and verify its completion

1. Install MySQL locally and create a credentials file for the pipeline

sudo apt-get install mysql-server
mysql -u root -p

CREATE USER ‘analytics‘@‘localhost‘ IDENTIFIED BY ‘edx‘;
GRANT ALL PRIVILEGES ON * . * TO ‘analytics‘@‘localhost‘;
FLUSH PRIVILEGES;

cd {PATH_TO_EDX_FOLDER}/analytics
vi mysql_creds

***BEGIN mysql_creds FILE***
{
"host": "127.0.0.1",
"port": "3306",
"username": "analytics",
"password": "edx",
"database": "analytics"
}
***END mysql_creds FILE***

cd edx-analytics-pipeline
vi override.cfg

***BEGIN override.cfg***
[database-export]
database = analytics
credentials = {PATH_TO_EDX_FOLDER}/analytics/mysql_creds

[database-import]
database = edxprod
destination = s3://<bucket for intermediate hadoop products>/intermediate/database-import
credentials = s3://<secrets bucket>/edxapp_prod_ro_mysql_creds

[event-logs]
expand_interval = 2 days
pattern = .*tracking.log-(?P<date>[0-9]+).*
source = s3://<bucket to where all tracking logs are synched>/tracking/

[hive]
warehouse_path = s3://<bucket for intermediate hadoop products>/warehouse/hive/

[manifest]
path = s3://<bucket for intermediate hadoop products>/user-activity-file-manifests/manifest
lib_jar = s3://<secrets bucket>/oddjob-1.0.1-standalone-modified.jar
input_format = oddjob.ManifestTextInputFormat

[enrollments]
blacklist_date = 2001-01-01
blacklist_path = /tmp/blacklist

[answer-distribution]
valid_response_types = customresponse,choiceresponse,optionresponse,multiplechoiceresponse,numericalresponse,stringresponse,formularesponse
***END EXAMPLE override.cfg***

2. Acquire a log file (or create a dummy one)

mkdir /tmp/log_files
cd /tmp/log_files

At this point, you can either acquire a log file from S3 or another developer or use the dummy file below (Include the empty line at the end). Either way, place it in /tmp/log_files

vi tracking.log-20150101-1234567890

*** BEGIN DUMMY LOG FILE ***
{"username": "test_user", "host": "class.stanford.edu", "event_source": "server", "event_type": "problem_check", "context": {"course_id": "edX/DemoX/DemoCourse", "course_user_tags": {}, "user_id": 555555, "org_id": "Education", "module": {"display_name": "Quiz - Reasoning"}}, "time": "2014-06-23T16:17:16.856434+00:00", "ip": "0.0.0.0", "event": {"submission": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"input_type": "checkboxgroup", "question": "Choose as many as you like.", "response_type": "choiceresponse", "answer": ["Reasoning is the essence of what mathematics is", "Reasoning is useful for working in most jobs", "Reasoning allows people to connect ideas and make mathematical breakthroughs"], "variant": "", "correct": false}}, "success": "incorrect", "grade": 0, "correct_map": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"hint": "", "hintmode": null, "correctness": "incorrect", "npoints": null, "msg": "", "queuestate": null}}, "state": {"student_answers": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": ["choice_2"]}, "seed": 1, "done": true, "correct_map": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"hint": "", "hintmode": null, "correctness": "incorrect", "npoints": null, "msg": "", "queuestate": null}}, "input_state": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {}}}, "answers": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": ["choice_0", "choice_1", "choice_2"]}, "attempts": 2, "max_grade": 1, "problem_id": "i4x://edX/DemoX-S/problem/a58470ee54cc49ecb2bb7c1b1c0ab43a"}, "agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0", "page": "x_module"}
{"username": "test_user_alt", "host": "class.stanford.edu", "event_source": "server", "event_type": "problem_check", "context": {"course_id": "edX/DemoX/DemoCourse", "course_user_tags": {}, "user_id": 555556, "org_id": "Education", "module": {"display_name": "Quiz - Reasoning"}}, "time": "2014-06-22T16:17:16.856434+00:00", "ip": "0.0.0.0", "event": {"submission": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"input_type": "checkboxgroup", "question": "Choose as many as you like.", "response_type": "choiceresponse", "answer": ["Reasoning is the essence of what mathematics is", "Reasoning is useful for working in most jobs", "Reasoning allows people to connect ideas and make mathematical breakthroughs"], "variant": "", "correct": false}}, "success": "incorrect", "grade": 0, "correct_map": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"hint": "", "hintmode": null, "correctness": "incorrect", "npoints": null, "msg": "", "queuestate": null}}, "state": {"student_answers": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": ["choice_2"]}, "seed": 1, "done": true, "correct_map": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"hint": "", "hintmode": null, "correctness": "incorrect", "npoints": null, "msg": "", "queuestate": null}}, "input_state": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {}}}, "answers": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": ["choice_4", "choice_5", "choice_6"]}, "attempts": 2, "max_grade": 1, "problem_id": "i4x://edX/DemoX-S/problem/a58470ee54cc49ecb2bb7c1b1c0ab43a"}, "agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0", "page": "x_module"}
{"username": "test_user", "host": "class.stanford.edu", "event_source": "server", "event_type": "problem_check", "context": {"course_id": "edX/DemoX/DemoCourse", "course_user_tags": {}, "user_id": 555555, "org_id": "Education", "module": {"display_name": "Quiz - Reasoning"}}, "time": "2014-06-22T16:17:16.856434+00:00", "ip": "0.0.0.0", "event": {"submission": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"input_type": "checkboxgroup", "question": "Choose as many as you like.", "response_type": "choiceresponse", "answer": ["Reasoning is the essence of what mathematics is", "Reasoning is useful for working in most jobs", "Reasoning allows people to connect ideas and make mathematical breakthroughs"], "variant": "", "correct": false}}, "success": "incorrect", "grade": 0, "correct_map": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"hint": "", "hintmode": null, "correctness": "incorrect", "npoints": null, "msg": "", "queuestate": null}}, "state": {"student_answers": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": ["choice_2"]}, "seed": 1, "done": true, "correct_map": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {"hint": "", "hintmode": null, "correctness": "incorrect", "npoints": null, "msg": "", "queuestate": null}}, "input_state": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": {}}}, "answers": {"i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1": ["choice_4", "choice_5", "choice_6"]}, "attempts": 2, "max_grade": 1, "problem_id": "i4x://edX/DemoX-S/problem/a58470ee54cc49ecb2bb7c1b1c0ab43a"}, "agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0", "page": "x_module"}
*** END DUMMY LOG FILE ***

3. Run the API locally and query for results of the pipeline‘s aggregation

cd PATH_TO_EDX_FOLDER/analytics/edx-analytics-pipeline
source ~/.venvs/edx-analytics-pipeline/bin/activate
launch-task AnswerDistributionToMySQLTaskWorkflow --local-scheduler --remote-log-level DEBUG --include *tracking.log* --src /tmp/log_files --dest /tmp/answer_dist --mapreduce-engine local --name test_task
mysql -u root -p

USE ANALYTICS;
SELECT COUNT(*) FROM answer_distribution;

If the pipeline task ran successfully (and you used the dummy file above), this should be the output:

+----------+
| COUNT(*) |
+----------+
|        2 |
+----------+
1 row in set (0.00 sec)

exit
deactivate
cd ../edx-analytics-data-api
source ~/.venvs/edx-analytics-data-api/bin/activate
./manage.py runserver --settings=analyticsdataserver.settings.local_mysql

Verify that the data API can connect to the database

1. Navigate to 127.0.0.1:8000 in your web browser:

If the page does not display and you see ImproperlyConfigured: Error loading MySQLdb module in the logs, run: ‘pip install mysql-python‘
If the page indicates a 401 access forbidden error, you need to rerun: ‘./manage.py set_api_key edx edx‘

2. Click on the answer_distribution query modal and enter ‘i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1‘ into the box (or a different module_id from your logs if you didn‘t use the dummy log file from above)

3. Click to request the data from the API, and the results should match the log file from above (or whichever you used)

时间： 2024-11-05 04:49:41

运行数据分析

How to install and run the analytics backend locally:

Clone the analytics repositories

1. Navigate to the folder you want to install into and git clone these repositories:

2. Create virtual environments in which to run the repositories

Install the dependencies

1. Installing edx-analytics-pipeline:

2. Installing edx-analytics-data-api:

3. Installing edx-analytics-data-api-client:

4. Installing edx-analytics-dashboard:

Run pipeline task locally and verify its completion

1. Install MySQL locally and create a credentials file for the pipeline

2. Acquire a log file (or create a dummy one)

3. Run the API locally and query for results of the pipeline‘s aggregation

Verify that the data API can connect to the database

1. Navigate to 127.0.0.1:8000 in your web browser:

2. Click on the answer_distribution query modal and enter ‘i4x-edX-DemoX-S-problem-a58470ee54cc49ecb2bb7c1b1c0ab43a_2_1‘ into the box (or a different module_id from your logs if you didn‘t use the dummy log file from above)

3. Click to request the data from the API, and the results should match the log file from above (or whichever you used)

运行数据分析的相关文章

物联网平台构架系列（四）：Amazon, Microsoft, IBM IoT 平台导论之平台

[转载] On Designing and Deploying Internet-Scale Services(译)-下

39个大数据可视化工具

Hadoop之生态系统

2015年主宰大数据技术的五大发展趋势

交易系统升级之性能测试思路

新手必备：大数据框架Hadoop主要模块解析

电厂锅炉安全：泄漏原因与预防措施整理

0编码构建AI模型