Dataquest用户数据分析

Thinking Through Analytics Data

本文将介绍如何从头到尾对数据进行分析。我们将探索Dataquest这个网站上用户的匿名化分析数据。我们将探索用户是如何进行学习的,数据源主要有两个:

  • 数据库
  • 外部分析者提供,比如keen.io

A Quick Look At Dataquest

首先需要明确Dataquest这个网站是怎样构造的:当前处在一个任务中,任务是由数据库,以及教授的一些技能组成。每个任务包含多个屏幕(screen ),屏幕的目录在右边,可以点击它跳到相应的屏幕中。这些屏幕可以是code屏幕,也可以是文本屏幕,code屏幕通常需要你写答案,然后这个习题来检测答案的正确性。系统所使用的语言是python3.

Looking At Student Data

第一个数据集来自数据库,包含了学生的:

  • 学习进展信息(progress data):是否成功完成某个屏幕,学生所写的代码,比如你刚完成了一个屏幕的内容就产生了一个新的记录(是否成功完成以及你的代码)。每个progress数据由一个pk值唯一确定。
  • 尝试数据(attempt data):包含学生对每个任务所作的各种代码尝试记录,每个progress data都有一个或多个与之关联的attempt data,每一个attempt数据有一个pk值唯一确定,attempt中screen_progress属性就是progress的pk值,这是attempt的外键,通过这个外键将其与progress联系到一起。
  • 为了是分析更简单,本文提取了50个学生的数据库信息:
# The attempts are stored in the attempts variable, and progress is stored in the progress variable.

# Here‘s how one progress record looks.
print("Progress Record:")
# Pretty print is a custom function we made to output json data in a nicer way.
pretty_print(progress[0])
print("\n")

# Here‘s how one attempt record looks.
print("Attempt Record:")
pretty_print(attempts[0])
‘‘‘
Progress Record:
{
    "fields": {
        "attempts": 0,
        "complete": true,
        "created": "2015-04-07T21:21:57.316Z",
        "last_code": "# We‘ll be coding in python.\n# Python is a great general purpose language, and is used in a lot of data science and machine learning applications.\n# If you don‘t know python, that‘s okay -- important concepts will be introduced as we go along.\n# In python, any line that starts with a # is called a comment, and is used to put in notes and messages.\n# It isn‘t part of the code, and isn‘t executed.",
        "last_context": null,
        "last_correct_code": "# We‘ll be coding in python.\n# Python is a great general purpose language, and is used in a lot of data science and machine learning applications.\n# If you don‘t know python, that‘s okay -- important concepts will be introduced as we go along.\n# In python, any line that starts with a # is called a comment, and is used to put in notes and messages.\n# It isn‘t part of the code, and isn‘t executed.",
        "last_output": "{\"check\":true,\"output\":\"\",\"hint\":\"\",\"vars\":{},\"code\":\"# We‘ll be coding in python.\\n# Python is a great general purpose language, and is used in a lot of data science and machine learning applications.\\n# If you don‘t know python, that‘s okay -- important concepts will be introduced as we go along.\\n# In python, any line that starts with a # is called a comment, and is used to put in notes and messages.\\n# It isn‘t part of the code, and isn‘t executed.\"}",
        "screen": 1,
        "updated": "2015-04-07T21:25:07.799Z",
        "user": 48309
    },
    "model": "missions.screenprogress",
    "pk": 299076
}

Attempt Record:
{
    "fields": {
        "code": "# We‘ll be coding in python.\n# Python is a great general purpose language, and is used in a lot of data science and machine learning applications.\n# If you don‘t know python, that‘s okay -- important concepts will be introduced as we go along.\n# In python, any line that starts with a # is called a comment, and is used to put in notes and messages.\n# It isn‘t part of the code, and isn‘t executed.",
        "correct": true,
        "created": "2015-03-01T16:33:56.537Z",
        "screen_progress": 231467,
        "updated": "2015-03-01T16:33:56.537Z"
    },
    "model": "missions.screenattempt",
    "pk": 62474
}
‘‘‘

Lists

  • 列表是一种数据结构,可以通过索引来获取索引中的某个元素。
arnold_movies = ["Terminator", "Predator", "Total Recall", "Conan the Barbarian"]

# We can add items to a list using .append.
# If for some reason, we want to acknowledge Kindergarten Cop as a movie, we can do this...
arnold_movies.append("Kindergarten Cop")

# We can get the second item in a list
print(arnold_movies[1])
bad_arnold_movies = ["Junior", "Batman & Robin", "Kindergarten Cop"]
‘‘‘
Predator
‘‘‘

Dictionaries

  • 字典是另一种数据结构,它的索引是独一无二的键,键可以是字符串也可以是数值。
# We can also nest dictionaries.
cities = {
        "Boston": {
            "weather": "ridiculously bad"
        },
        "San Francisco": {
            "weather" : "okay, better than Boston at least"
        },
        "San Diego": {
            "weather" : "why haven‘t you moved here yet?"
        }
    }
weird_presidential_facts = [
    {"name": "Benjamin Harrison", "oddity": "Afraid of electricity."},
    {"name": "Theodore Roosevelt", "oddity": "Had a pet badger named Josiah who bit people."},
    {"name": "Andrew Jackson", "oddity": "Taught his parrot to curse."}
]

The Structure Of The Data

可以发现progress以及attempts都是字典格式的数据。

Progress record

  • pk – the id of the record in the database
  • fields
    • attempts – a count of how many attempts the student made on the

      screen.

    • complete – whether the student successfully passed the screen (True

      if they have / False if not).

    • created – what time the student first saw the screen.
    • last_code – the text of the last code the student wrote.
    • last_correct_code – the last code the student wrote that was

      correct. Null if they don’t have anything correct.

    • screen – the id of the screen this progress is associated with.
    • user – the id of the user this progress is associated with.

Attempt record

  • pk – the id of the record in the database
  • fields
    • code – the code that was submitted for this attempt.
    • correct – whether or not the student got the answer right.
    • screen_progress – the id of the progress record this attempt is associated with.
# This gets the fields attribute from the first attempt, and prints it
# As you can see, fields is another dictionary
# The keys for fields are listed above
pretty_print(attempts[0]["fields"])
print("\n")

# This gets the "correct" attribute from "fields" in the first attempt record
print(attempts[0]["fields"]["correct"])
‘‘‘
{
    "code": "# We‘ll be coding in python.\n# Python is a great general purpose language, and is used in a lot of data science and machine learning applications.\n# If you don‘t know python, that‘s okay -- important concepts will be introduced as we go along.\n# In python, any line that starts with a # is called a comment, and is used to put in notes and messages.\n# It isn‘t part of the code, and isn‘t executed.",
    "correct": true,
    "created": "2015-03-01T16:33:56.537Z",
    "screen_progress": 231467,
    "updated": "2015-03-01T16:33:56.537Z"
}

True

‘‘‘

Exploring The Data

得到详细的数据后,可以计算一些东西,来进一步了解数据:

  • The number of attempts.
  • The number of progress records.
  • The number of attempts each student makes per screen (# of attempts /

    # of progress records).

# Number of screens students have seen
progress_count = len(progress)
print(progress_count)

# Number of attempts
attempt_count = len(attempts)
print(attempt_count)
‘‘‘
2134
3995
‘‘‘

Getting To User Level Data

  • 我们需要获取用户是如何与网站进行交互的,比如用户通过了多少个任务等等,所以首先需要获取有多少个用户id,然后根据id分组计算频数:
# A list to put the user ids
all_user_ids = []

# A for loop lets us repeat code.
# In this case, we‘re pulling each record from the progress list, in order,
# and doing a manipulation.
for record in progress:
    user_id = record["fields"]["user"]
    all_user_ids.append(user_id)

# This pulls out only the unique user ids
all_user_ids = list(set(all_user_ids))
‘‘‘
all_user_ids  : list (<class ‘list‘>)
[51331,
 52100,
 58628,
 54532,
 55945,
 46601,
 50192,
 ...
‘‘‘

Vectors

  • asarray可以将list对象转换为array,其实直接用array就可以了。
# We can import python package like this.
import numpy

# The numpy asarray method converts a list to an array.
vector = numpy.array([1,2,3,4])

Matrices

  • 矩阵是二维数组,矩阵的索引形式如下matrix[1,2]。
# If we use the as keyword, we can import something, but rename it to a shorter name
# This makes it easier when we do analysis because we don‘t have to type the full name
import numpy as np

# If we pass a list of lists to asarray, it converts them to a matrix.
matrix = np.asarray([
        [1,2,3],
        [4,5,6],
        [7,8,9],
        [10,11,12]
    ])
matrix_1_1 = matrix[1,1]
matrix_0_2 = matrix[0,2]

Pandas Dataframes

  • Dataframes和矩阵相似,但是Dataframes存储的数据的每个列可以是不同的数据类型。并且有许多內建的函数可以用来做数据分析和数据可视化。创建一个Dataframes对象的最简单的方法就是通过字典列表来初始化。但是这个字典里面不能有嵌套,也就是所有的键都在同一个水平上。所以对于我们的数据需要做些调整。由于之前一条progress记录包含pk,field主键,而field里面又有很多子键,因此将field去掉,将其子键提取出来和pk在同一个水平。
# "Flatten" the progress records out.
flat_progress = []
for record in progress:
    # Get the fields dictionary, and use it as the start of our flat record.
    flat_record = record["fields"]
    # Store the pk in the dictionary
    flat_record["pk"] = record["pk"]

    # Add the flat record to flat_progress
    flat_progress.append(flat_record)
flat_attempts = []
for record in attempts:
    flat_record = record["fields"]
    flat_record["pk"] = record["pk"]
    flat_attempts.append(flat_record)

Creating Dataframes

import pandas as pd

progress_frame = pd.DataFrame(flat_progress)
# Print the names of the columns
print(progress_frame.columns)
‘‘‘
Index([‘attempts‘, ‘complete‘, ‘created‘, ‘last_code‘, ‘last_context‘, ‘last_correct_code‘, ‘last_output‘, ‘pk‘, ‘screen‘, ‘updated‘, ‘user‘], dtype=‘object‘)
‘‘‘
attempt_frame = pd.DataFrame(flat_attempts)

Indexing Dataframes

  • 现在通过Dataframes的一些内建函数可以很快速简单的获取一些统计信息,比如有多少个用户user_ids,以及每个用户的progress个数user_id_counts,以及每个屏幕被记录了多少次screen_counts.并且value_counts()是按照值从小到大排序的。
# Get all the unique values from a column.
user_ids = progress_frame["user"].unique()

# Make a table of how many screens each user attempted
user_id_counts = progress_frame["user"].value_counts()
print(user_id_counts)
screen_counts = progress_frame["screen"].value_counts()
‘‘‘
46578    177
48108    136
49340    135
54823    131
47451    123
42983    118
52584    108
...
‘‘‘

Making Charts

import matplotlib.pyplot as plt

# Plot how many screens each user id has seen.
# The value_counts method sorts everything in descending order.
user_counts = progress_frame["user"].value_counts()

# The range function creates an integer range from 1 to the specified number.
x_axis = range(len(user_counts))

# Make a bar plot of the range labels against the user counts.
plt.bar(x_axis, user_counts)

# We have to use this to show the plot.
plt.show()
  • 下图显示的是每用户的progess数量,由于user_counts是排好序的,因此图中的从打大小排下来的。

Pandas Filtering

  • 选择第一个屏幕的progress数据:
screen_1_frame = progress_frame[progress_frame["screen"] == 1]

Matching Attempts To Progress

将每个attempt和对应的progess(每个用户对每个screen都会产生一个pregress记录)联系在一起,这样才可以统计每个screen总共有多少个attempt,他们中有多少个是正确的。attempt可以通过screen_progress (the id of the progress record this attempt is associated with)这个属性将其与progess(pk)联系在一起。

  • 下面这个代码是找到1137条记录的的尝试情况:
# 这是个布尔型Series,找到第1137条progress(某个人对某个screen的详细信息)的记录的尝试信息。
has_progress_row_id = attempt_frame["screen_progress"] == progress_frame["pk"][1137]
progress_attempts = attempt_frame[has_progress_row_id]
# 一共有49条尝试,正确的有5条,错误的有44条
correct_attempts_count = progress_attempts[progress_attempts["correct"] == True].shape[0]
incorrect_attempts_count = progress_attempts[progress_attempts["correct"] == False].shape[0]

Figuring Out Attempt Ratios

  • DataFrame对象通过groupby函数根据某列”screen_progress”的取值将其分组得到一个DataFrameGroupBy对象。然后利用groups.aggregate将DataFrameGroupBy对象中某个取值进行聚合np.mean得到比率。的通过correct将其聚类
import numpy as np
import matplotlib.pyplot as plt
# Split the data into groups
groups = attempt_frame.groupby("screen_progress")
ratios = []
# Compute ratios for each group
# Loop over each group, and compute the ratio.
for name, group in groups:
    # The ratio we want is the number of correct attempts divided by the total number of attempts.
    # Taking the mean of the correct column will do this.
    # If you take the sum or mean of a boolean column, True values will become 1, and False values 0.
    ratio = np.mean(group["correct"])

    # Add the ratio to the ratios list.
    ratios.append(ratio)

# This code does the same thing as the segment above, but it‘s simpler.
# We aggregate across each group using the np.mean function.
# This takes the mean of every column in each group, then makes a dataframe with all the means.
# We only care about correctness, so we only select the correct column at the end.
easier_ratios = groups.aggregate(np.mean)["correct"]
print(groups.aggregate(np.mean))
‘‘‘
                  correct      pk
screen_progress
231467           1.000000   62474
231470           1.000000   62476
231474           1.000000   62477
231476           1.000000   62479
...
400199           0.333333  340950
400201           1.000000  340952
400202           0.500000  340953
400204           1.000000  340956
400205           1.000000  340958
‘‘‘

# We can plot a histogram of the easier_ratios series.
# The kind argument specifies that we want a histogram.
# Histograms show how values are distributed -- in this case, 900 of the screens have only 1 (correct) attempt.
# Many more appear to have had two attempts (a .5 ratio).
easier_ratios.plot(kind="hist")
plt.show()
counts = groups.aggregate(len)["correct"]
print(groups.aggregate(len))
‘‘‘
                 code  correct  created  pk  updated
screen_progress
231467              1        1        1   1        1
231470              1        1        1   1        1
231474              1        1        1   1        1
231476              1        1        1   1        1
231481              1        1        1   1        1
231489              1        1        1   1        1
231492              2        2        2   2        2
231493              4        4        4   4        4
...
‘‘‘
counts.plot(kind="hist")
plt.show()
‘‘‘
easier_ratios : Series (<class ‘pandas.core.series.Series‘>)
screen_progress
231467             1
231470             1
...
400190             0.500000
400199             0.333333
400201             1.000000
400202             0.500000
400204             1.000000
400205             1.000000

counts : Series (<class ‘pandas.core.series.Series‘>)
screen_progress
231467             1
231470             1
231474             1
231476             1
...
400190             2
400199             3
400201             1
400202             2
400204             1
400205             1
‘‘‘

Who Gives Up?

  • 探索谁放弃了继续学习,可能是这个任务比较困难。那么如果我们获取到这个信息,就可以采取一些措施来帮助他。所以首先需要获取用户放弃之前所作的一些尝试。
  • 首先找到progress中没有如果complete的那些记录,然后将与之相连的attempts提取出来。gave_up_ids 存储的是那些放弃了的process的pk值。可以通过pk值与attempt中的screen_progress 相连。
gave_up = progress_frame[progress_frame["complete"] == False]
gave_up_ids = gave_up["pk"]

Graphing Attempt Counts

  • 现在获取那些放弃学习的用户的attempt数据,pandas中有一个isin函数可以实现这个功能。其中groups.aggregate(len)计算的是每组(screen_progress)的个数:
gave_up_boolean = attempt_frame["screen_progress"].isin(gave_up_ids)
gave_up_attempts = attempt_frame[gave_up_boolean]
groups = gave_up_attempts.groupby("screen_progress")
counts = groups.aggregate(len)["correct"]
counts.plot(kind="hist")
plt.show()

Attempt Count Differential

可以发现大部分人进行了一次尝试失败后就放弃了,当然有些长尾数据,有个人尝试了15次才放弃了。现在看看没有放弃的那些人普遍提交了多少次:

gave_up = attempt_frame[attempt_frame["screen_progress"].isin(gave_up_ids)]
groups = gave_up.groupby("screen_progress")
counts = groups.aggregate(len)["correct"]

# We can use the .mean() method on series to compute the mean of all the values.
# This is how many attempts, on average, people who gave up made.
print(counts.mean())

# We can filter our attempts data to find who didn‘t give up (people that got the right answer).
# To do this, we use the ~ operator.
# It negates a boolean, and swaps True and False.
# This filters for all rows that aren‘t in gave_up_ids.
eventually_correct = attempt_frame[~attempt_frame["screen_progress"].isin(gave_up_ids)]
groups = eventually_correct.groupby("screen_progress")
counts = groups.aggregate(len)["correct"]
print(counts.mean())
‘‘‘
2.89473684211
2.4858044164
‘‘‘
  • 从结果中发现放弃的人提交的平均次数要大于没有放弃的人提交的平均次数。

Another Data Store

为了更好的帮助那些放弃的用户,我们需要获取更细粒度的数据。有些数据比如用户播放一个video或者点击一个按钮这种信息不会被存储在数据库中,这些数据会被存储在一个特殊的分析数据库,这些是通过网站的前端收集到的。我们挑选了其中一些信息进行分析:

  • started-mission – a mission is started by a student
  • started-screen – a screen in a mission is started
  • show-hint – a click on the “hint” button
  • run-code – a click on the “run” button
  • reset-code – a click on the “reset code” button
  • next-screen – a click on the “next” button
  • get-answer – a click on the “show answer” button

以上这些信息被存储为一个session,一个session代表一个用户在一段时间内所采取的一些点击行为(一个点击行为就是一个事件event)。我们随机抽样了200个用户session数据进行分析。

‘‘‘
sessions
list (<class ‘list‘>)
[[{‘event_type‘: ‘started-mission‘,
   ‘keen‘: {‘created_at‘: ‘2015-06-12T23:09:03.966Z‘,
    ‘id‘: ‘557b668fd2eaaa2e7c5e916b‘,
    ‘timestamp‘: ‘2015-06-12T23:09:07.971Z‘},
   ‘sequence‘: 1},
  {‘event_type‘: ‘started-screen‘,
   ‘keen‘: {‘created_at‘: ‘2015-06-12T23:09:03.979Z‘,
    ‘id‘: ‘557b668f90e4bd26c10b6ed6‘,
    ‘timestamp‘: ‘2015-06-
    ...
‘‘‘
# We have 200 sessions
print(len(sessions))

# The first session has 38 student events
print(len(sessions[0]))

# Here‘s a single event from the first user session -- it‘s a started-screen event
print(sessions[0][3])

# We‘ll make a histogram of event counts per session
plt.hist([len(s) for s in sessions])
plt.show()
‘‘‘
200
38
{‘event_type‘: ‘started-screen‘, ‘mission‘: 1, ‘type‘: ‘code‘, ‘sequence‘: 2, ‘keen‘: {‘timestamp‘: ‘2015-06-12T23:09:28.589Z‘, ‘id‘: ‘557b66a4672e6c40cd9249f7‘, ‘created_at‘: ‘2015-06-12T23:09:24.688Z‘}}
‘‘‘

Event Structure

  • event_type – the type of event – there’s a list of event types in the last screen.
  • created_at – when the event occured – in the keen dictionary.
  • id – the unique id of the event – in the keen dictionary.
  • sequence – this field varies by event type – for started-mission events, it’s the mission that was started. For all other events, it’s the screen that the event occured on. Each mission consists of multiple screens.
  • mission – If the event occurs on a screen, then this is the mission the event occurs in.
  • type – if the event occurs on a screen, the type of screen (code, video, or text).
# Where we‘ll put the events after we "flatten" them
flat_events = []

# If we‘re going to combine everything in one dataframe, we need to keep
# track of a session id for each session, so we can link events across sessions.
session_id = 1
# Loop through each session.
for session in sessions:
    # Loop through each event in each session.
    for event in session:
        new_event = {
            "session_id": session_id,
            # We use .get() to get the fields that could be missing.
            # .get() will return a default null value if the key isn‘t found in the dictionary.
            # If we used regular indexing like event["mission"], we would get an
            # error if the key wasn‘t found.
            "mission": event.get("mission"),
            "type": event.get("type"),
            "sequence": event.get("sequence")
        }

        new_event["id"] = event["keen"]["id"]
        new_event["created_at"] = event["keen"]["created_at"]
        new_event["event_type"] = event["event_type"]
        flat_events.append(new_event)

    # Increment the session id so each session has a unique id.
    session_id += 1
  • 为了将数据整理为DataFrame,首先需要将其转换为字典列表,所以需要将键都调整为统一水平,并且添加了一个新的键session_id,因为每个session有多个event,因此通过session_id将这些event联系起来。

Convert To Dataframe

event_frame = pd.DataFrame(flat_events)

Exploring The Session Data

现在可以进行event数据的分析,首先,我们想知道用户在做完一个session的过程中最常做的event是什么?一个session结束表示用户在这个平台的学习结束了,这里面肯定有很好的模式值得我们去学习。这与之前的用户的放弃行为有关系。当他们决定离开这个平台时他们会做什么?所以首先需要将event按照created_at进行升序排列,然后按照session_id进行分组,那么对于每个session中的最后的event就是结束的event。

# Sort event_frame in ascending order of created_at.
event_frame = event_frame.sort(["created_at"], ascending=[1])

# Group events by session
groups = event_frame.groupby("session_id")

ending_events = []
for name, group in groups:
    # The .tail() method will get the last few events from a dataframe.
    # The number you pass in controls how many events it will take from the end.
    # Passing in 1 ensures that it only takes the last one.
    last_event = group["event_type"].tail(1)
    ending_events.append(last_event)

# The concat method will combine a list of series into a dataframe.
ending_events = pd.concat(ending_events)
ending_event_counts = ending_events.value_counts()
ending_event_counts.plot(kind="bar")
plt.show()

Most Common Events

event_counts = event_frame["event_type"].value_counts()
event_counts.plot(kind="bar")
plt.show()

Discussion

最常见的event和用户离开平台前的event有一个最主要的区别:绝大多数人在离开平台前都会触发started-screen event,要远远大于其平均概率。主要原因分析如下:

- 当人们打开一个screen,看了一眼觉得太难了然后离开这个学习平台。

- 或者他们打开了一个screen,运行的时间太长,使得他们离开了电脑。】

我们需要与用户交谈,来确定到底是什么原因导致他们离开,然后才去一些措施,提高用户学习的时间。

- 我们也可以看看哪个任务或者哪个屏幕用户在上面那个放弃了,这可以使我们意识到这个屏幕的内容或许太难或者太简单,然后做出相应的调整。

Mission Numbers

  • 我们想看看哪个mission上面的event最多,很显然,最开始mission有更多的观众。
event_counts = event_frame["mission"].value_counts()
event_counts.plot(kind="bar")
plt.show()

Explore!

  • mission中有字符型数据和数值型数据,所以上面那个条形图是对的。
count = event_frame["mission"].unique()
‘‘‘
ndarray (<class ‘numpy.ndarray‘>)
array([None, 5, ‘5‘, ‘3‘, 3, 2, 7, ‘2‘, 6, ‘6‘, 1, ‘1‘, ‘9‘, 9, 4, ‘4‘,
       ‘7‘, ‘8‘, 8, ‘33‘, 51, ‘51‘], dtype=object)
‘‘‘

我们可以从数据中探索下面这些有趣的问题:

  • 基于一个用户当前的sequence是否能预测他下一步要采取的动作
  • 是否某些events经常出现在某些missions
  • 能否评估mission的困难度
  • 其他的数据怎么收集
时间: 2024-10-05 19:48:00

Dataquest用户数据分析的相关文章

从Ashley Madison泄露3300万用户数据分析用户构成

从Ashley Madison泄露3300万用户数据分析用户构成. 随着黑客曝光婚外情网站Ashley Madison3300万用户资料,生成9.7G压缩包的BT种子后,越来越多人了解和重视,有新闻报道已经有多起自杀案件与这件泄露事件有关.现在离汉庭.如家等的2000万用户开房数据泄露已经过去有一段时间了,其影响大家有目共睹,多少家庭因此而破裂!多少人因此遭到无数个骚扰电话!而这次事件不同于如家数据泄露的是Ashley Madison网站本身就存在道德问题.Ashley Madison是全球最大

Hadoop单点部署与案例开发(微博用户数据分析)

一.环境搭建 1.Hadoop运行环境搭建 1.1 安装虚拟机 (1)下载并安装VMware虚拟机软件. (2)创建虚拟机,实验环境虚拟机配置如下图所示. (3)安装Ubuntu系统,安装结果如下图所示. 1.2  配置JDK环境 下载并安装JDK,安装结束后需对java环境进行配置,配置成功结果如下所示. 2.Hadoop安装和部署 (1)创建Hadoop安装文件夹,并切到到此路径下. (2)从 hadoop.apache.org 下载Hadoop 安装文件,并复制文件到安装Hadoop的文件

用户数据分析工具 GrowingIO

1.通过cocoapods 导入 添加 pod 'GrowingIO' 到 Podfile 中. 2.登录创建应用,获取id. 3.在 AppDelegate 中引入#import "Growing.h" 4. [Growing startWithAccountId:@"您的项目ID"]; 5. 开启Growing调试日志 可以开启日志 // [Growing setEnableLog:YES]; 6.添加把 URL Scheme 添加到您的项目,以便我们唤醒您的程

【Python数据分析】用户通话行为分析

主要工作: 1.对从网上营业厅拿到的用户数据.xls文件,通过Python的xlrd进行解析,计算用户的主叫被叫次数,通话时间,通话时段. 2.使用matplotlib画图包,将分析的结果直观的绘制出来. 具体步骤: 1.分析须要的内容 excel文件中包含很多信息,我们分析的时候须要用到的包括,通话起始时间.通话时长.呼叫类型,号码. 使用xlrd模块,读取excel中的数据,用列表的形式保存下来. 1 #coding=utf-8 2 import xlrd 3 4 def readData(

PHP 基于laravel框架获取微博数据之二 用户数据的使用

开始抓取微博数据的时候,只是想获得一条热门微博下的所有评论,因为里面有不少图片广告,所以想试试能不能分析出热门微博评论里的异常用户. 使用PHP的Laravel框架后,通过队列.命令等各种功能,最后构架了一套完整的微博用户数据抓取平台,经过一段时间的运行积累了大量数据,那么使用这些数据能做什么呢? 微博数据分析很早就有人在做了,网上采集分析工具貌似有很多,搜索一下想找一些微博数据分析的具体方案.世事变幻,发现很多几年前的微博数据分析平台都不能用了,可能微博数据分析和微博一样在商业上还是没有什么更

基于Python Spark的大数据分析_pyspark实战项目课程

基于Python Spark的大数据分析(第一期) 课程介绍地址:http://www.xuetuwuyou.com/course/173 课程出自学途无忧网:http://www.xuetuwuyou.com 讲师:轩宇老师 1.开课时间:小班化教学授课,第一期开课时间为5月20号(满30人开班,先报先学!): 2.学习方式:在线直播,共8次课,每次2小时,每周2次(周三.六,晚上20:30 - 22:30),提供在线视频,课后反复学习: 3.报名课程后,请联系客服申请加入班级答疑交流QQ群:

天律的云端大数据分析挖掘之旅

随着数据爆炸式的增长,我们正被各种数据包围着,最为平常的使用网络.手机.各种电子设备,每天都在产生各种新的数据.大部分的企业和机构都面临着这样一个问题,需要从海量的历史.实时数据中寻找规律,从而为决策者提供科学的依据.但不可否认的是,现代所产生的信息量过于庞大,传统的业务软件已经远远不能满足这样的要求,而构建大规模数据处理中心对于大部分企业来讲都是一笔过于庞大的开支.这就迫切需要一种新颖的.高效的.成本低廉的技术来支撑对数据的挖掘工作,云计算无疑是最佳选择. 信息时代,一寸数据一寸金 IT环境已

大数据分析的下一代架构--IOTA

IOTA是什么?你是否为下一代大数据架构做好准备? 经过这么多年的发展,已经从大数据1.0的BI/Datawarehouse时代,经过大数据2.0的Web/APP过渡,进入到了IOT的大数据3.0时代,而随之而来的是数据架构的变化. ▌Lambda架构 在过去Lambda数据架构成为每一个公司大数据平台必备的架构,它解决了一个公司大数据批量离线处理和实时数据处理的需求.一个典型的Lambda架构如下: 数据从底层的数据源开始,经过各种各样的格式进入大数据平台,在大数据平台中经过Kafka.Flu

数据分析最具价值的49个案例(建议收藏)

导读:本文是近年来不同行业.不同领域的大数据公司的一些经典案例总结.尽管有些已经是几年前的案例,但其中的深层逻辑对于未来仍有启发. 本文力图从企业运营和管理的角度,梳理出发掘大数据价值的一般规律:一是以数据驱动的决策,主要通过提高预测概率,来提高决策成功率:二是以数据驱动的流程,主要是形成营销闭环战略,提高销售漏斗的转化率:三是以数据驱动的产品,在产品设计阶段,强调个性化:在产品运营阶段,则强调迭代式创新. 学习Python中有不明白推荐加入交流裙                     号:6