CIS 545 - Big Data Analytics

CIS 545 - Big Data Analytics - Fall 2019 Have you ever wondered about (1) what it takes to be a data scientist or "data person", and (2) how so
work?
This homework is focused on (1) working with hierarchical data stored in dataframes, (2) traversing re
data, (3) understanding a bit about performance.
We will focus on questions about data scientists from "our" crawl of the LinkedIn dataset, which was a
extended notebook.
!pip install pymongo[tls,srv]
!pip install swifter
!pip install lxml
import pandas as pd
import numpy as np
import json
import sqlite3
from lxml import etree
import urllib
import zipfile
import time
import swifter
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError, OperationFailure
We need to pull the ziple
with LinkedIn data from Amazon S3 (where it is shared) to your local machi
machine. Only when the data is local can we eciently
parse it (and we‘ll read directly out of a zip le)
The zip le
contains three les
with the same schema. You can start with the tiny instance to test yo
brave and have a lot of time feel free to use the full le.
Step 0: Acquire and load data
Due October 11, 2019 at 10pm
Homework 2: Querying Linked (LinkedIn) Data
We will grade your homework using small . Hidden test 0.0 will override your le
selection, so as lon
in a cell that comes after that one, you will be ne.
linkedin.json (3M records)
linkedin_small.json (100K records)
linkedin_tiny.json (10K records)
The cell below will download a 3GB le
to your Google Cloud. It may take a while. You do not need to m
#url = ‘https://upenn-bigdataanalytics.s3.amazonaws.com/linkedin.zip‘
#filehandle, _ = urllib.request.urlretrieve(url,filename=‘local.zip‘)
filehandle = ‘local.zip‘
# What‘s the zip file actually called locally?
filehandle
The cell below creates pointers to the three versions of our dataset. To switch between them, simply c
the cell below.
def fetch_file(fname):
    zip_file_object = zipfile.ZipFile(filehandle, ‘r‘)
    for file in zip_file_object.namelist():
        file = zip_file_object.open(file)
        if file.name == fname: return file
    return None
    
linkedin_tiny = fetch_file(‘linkedin_tiny.json‘)
linkedin_small = fetch_file(‘linkedin_small.json‘)
linkedin_huge = fetch_file(‘linkedin.json‘)
# CIS 545 Hidden Test 0.0 - please do not modify or delete this cell!
# Set the input file to process
file = linkedin_small In the cell below, adapt the data loading code from the in-class notebook. You will need the function th
the function that converts relations to dataframes. Read in a maximum of 20000 people. Put the code
relations, removes the interval eld,
and stores the eld
information with a try statement, just in case.
command to move on. At the end of the next cell, you should have nine dataframes with the following
Step 0.1: Store data in dataframes
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 3/12
1. people_df
2. names_df
3. education_df
4. groups_df
5. skills_df
6. experience_df
7. honors_df
8. also_view_df
9. events_df
# TODO: Adapt the data loading code from class.
# YOUR CODE HERE
raise NotImplementedError()
# CIS 545 Sanity Check 0.1 - please do not modify or delete this cell!
display(experience_df)
# CIS 545 Hidden Test 0.1.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 0.1.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 0.1.3 - please do not modify or delete this cell!
Next save the data to SQLite... Again, using the same approach as in the sample notebook.
Step 0.2: Convert to SQL conn = sqlite3.connect(‘linkedin.db‘)
# YOUR CODE HERE
raise NotImplementedError()
# CIS 545 Sanity Check 0.2.1 - please do not modify or delete this cell!
people_df.describe()
# CIS 545 Sanity Check 0.2.2 - please do not modify or delete this cell!
skills_df.describe()
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 4/12
_
# CIS 545 Sanity Check 0.2.3 - please do not modify or delete this cell!
experience_df.describe()
In this homework, we will use LinkedIn to analyze what it means to be a data scientist (as of a few yea
Step 1: What is a data scientist?
Our rst
question is: for anyone who‘s job revolves around data (database administrators, data curator
are the most common skills?
Step 1.1: What are common skills for data scientists?
Complete the collect_skills function below. This and the other functions in this homework allow u
queries even if your data do not match ours. The function should:
1. Using experience_df , nd
all people with a position containing "data" in the title. Remember upper versus lo
2. Using skills_df , nd
all people with "data science" as a skill. Again, remember to account for case.
3. For all of the unique people found in steps 1 and 2, nd
the rest of their skills
4. Return a dataframe of the top 15 skills, by frequency (see pandas.DataFrame.sort_values). The columns shou
scientists (the count of the number of data scientists with this skill).
Step 1.1.1: Collect skills (Pandas) # TODO: Find the top 15 skills for data scientists (Pandas)
def collect_skills(experience_df, people_df, skills_df):
    # YOUR CODE HERE
    raise NotImplementedError()
# CIS 545 Sanity Check 1.1.1 - please do not modify or delete this cell!
top_skills_df = collect_skills(experience_df, people_df, skills_df)
display(top_skills_df)
if "skill" not in top_skills_df:
    raise AssertionError("skill column not defined")
if "scientists" not in top_skills_df:
    raise AssertionError("scientists column not defined")
if len(top_skills_df) != 15:
    raise AssertionError("dataframe does not have top 15")  
# CIS 545 Hidden Test 1.1.1.1 - please do not modify or delete this cell!
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 5/12
代写CIS 545作业、Data Analytics作业代写、代做Python程序语言作业、Python实验作业代做
# CIS 545 Hidden Test 1.1.1.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.1.1.3 - please do not modify or delete this cell!
Compute the same table as in 1.1.1 using SQL. Store it as top_skills_sql but otherwise matching t
to also save the data to SQLLite in a table called top_skills , as we will be testing to see if this table
Step 1.1.2: Top skills (SQL) # TODO: Find the top 15 skills for data scientists (SQL)
# YOUR CODE HERE
raise NotImplementedError()
display(top_skills_sql)
# CIS 545 Sanity Check 1.1.2 - please do not modify or delete this cell!
if "skill" not in top_skills_sql:
    raise AssertionError("skill column not defined")
if "scientists" not in top_skills_sql:
    raise AssertionError("scientists column not defined")
if len(top_skills_df) < 1:
    raise AssertionError("dataframe has no results")  
if len(top_skills_sql.merge(top_skills_df)) != len(top_skills_sql):
    raise AssertionError("Pandas and SQL versions are not of the same length")
# CIS 545 Hidden Test 1.1.2 - please do not modify or delete this cell!
Complete the collect_titles function below that aggregates the most recent titles of people with d
use the given dataframes as input and return a two column dataframe: one column called title and
consider people who have at least min_skills of the top skills for a data scientist. You should also o
min_count times.
For extra practice, you can also do this in SQL, although we are not grading that.
Step 1.2: What are common titles for those with data science skills? # TODO: Find the common titles (Pandas)
d f ll t titl (t kill df kill df l df i df i kill i
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 6/12 def collect_titles(top_skills_df, skills_df, people_df, experience_df, min_skills, min
    # YOUR CODE HERE
    raise NotImplementedError()
# CIS 545 Sanity Check 1.2 - please do not modify or delete this cell!
ds_titles_df = collect_titles(top_skills_df, skills_df, people_df, experience_df, 6, 2
display(ds_titles_df)
if "title" not in ds_titles_df:
    raise AssertionError("title column not defined")
if "count" not in ds_titles_df:
    raise AssertionError("count column not defined")
if len(ds_titles_df) < 1:
    raise AssertionError("dataframe has no results")
# CIS 545 Hidden Test 1.2.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.2.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.2.3 - please do not modify or delete this cell!
Now let‘s nd
the list of companies that have employed people with the above titles, ranked by numbe
Step 1.3: Who employs "data people" based on title?
Complete the collect_employers function below that aggregates the employers with positions corr
people with data science skills. This function should use the given dataframes as input and return a tw
org and the other called people . Show the names of companies (in eld
org ) with at least min_cou
(include that count in the people column). Order the dataframe by the count of data people in the com
Step 1.3.1: Data employers # TODO: Find the data employers
def collect_employers(experience_df, ds_titles_df, min_count):
    # YOUR CODE HERE
    raise NotImplementedError()
# CIS 545 Sanity Check 1.3.1 - please do not modify or delete this cell!
employers_df = collect_employers(experience_df, ds_titles_df, 5)
display(employers df)
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 7/12
p y( p y _ )
if "IBM" not in employers_df[‘org‘].tolist():
    raise AssertionError("Missing IBM")
    
if employers_df[‘people‘].min() < 4:
    raise AssertionError("Not filtering properly")
# CIS 545 Hidden Test 1.3.1.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.3.1.2 - please do not modify or delete this cell!
Complete the collect_employees function below that aggregates the employees of employers with
recent titles of people with data science skills. In other words, who are the employees of the data emp
their titles? This function should use the given dataframes as input and return the org , family_name
person.
Step 1.3.2: Their employees # TODO: Find the employees of the data employers
# YOUR CODE HERE
raise NotImplementedError()
# CIS 545 Sanity Check 1.3.2 - please do not modify or delete this cell!
title_people_df = collect_employees(people_df, experience_df, employers_df, names_df, 
display(title_people_df)
if len(title_people_df.columns) != 4:
    raise AssertionError(‘Wrong number of columns. Check schema again‘)
# CIS 545 Hidden Test 1.3.2.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.3.2.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.3.2.3 - please do not modify or delete this cell!
Step 1.4: Find peers
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 8/12
In many common social graph settings, we can make recommendations to people based on their simi
dene
similarity in terms of the number of identical skills.
Suppose A and B have similar skills: A -> X1 and B -> X1, A -> X2 and B -> X2, etc. up to A -> Xk and B ->
Then given that A and B have similar skills, we might recommend A‘s employer to B, and vice versa.
Let‘s consider only the rst
100 people in people_df .
Find, out of this set, the pairs of people with the most shared/common skills, and return the closest 20
this to make a recommendation for a potential employer and position to each person.
Step 1.4.0: Making the problem tractable in Pandas
Complete the collect_peers function below that nds
the top num pairs of peers. In other words, co
person, counting the total set of skills in common. This function should use the given dataframes and
dataframe: person_1 , person_2 , and common_skills . The rst
two columns should be person IDs a
of skills that this pair of people shares.
Hint: Doing this requires a Cartesian product, i.e., every ID paired with every other ID. Think about how t
then add a eld
to this dataframe that will let us combine every record with every record.
Step 1.4.1: Compute the top pairs of peers # TODO: Finish the collect_peers function
people_df_subset = people_df.head(100)
def collect_peers(people_df_subset, skills_df, num):
    # YOUR CODE HERE
    raise NotImplementedError()
# CIS 545 Sanity Check 1.4.1 - please do not modify or delete this cell!
recs_df = collect_peers(people_df_subset, skills_df, 20)
display(recs_df)
if "person_1" not in recs_df:
    raise AssertionError("person_1 column not defined")
if "person_2" not in recs_df:
    raise AssertionError("person_2 column not defined")
if "common_skills" not in recs_df:
    raise AssertionError("common_skills column not defined")
if(len(recs_df) != 20):
    raise AssertionError(‘Wrong number of rows in recs_df‘)
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 9/12 # CIS 545 Hidden Test 1.4.1.1 - please do not modify or delete this cell!
# CIS 545 Hidden Test 1.4.1.2 - please do not modify or delete this cell!
Complete the last_job function below that takes experience_df as input and returns the person ,
person‘s last (most recent) employment experience (three column dataframe).
Step 1.4.2: Get the last jobs # TODO: Complete the last_job function
def last_job(experience_df):
    # YOUR CODE HERE
    raise NotImplementedError()
# CIS 545 Sanity Check 1.4.2 - please do not modify or delete this cell!
last_job_df = last_job(experience_df)
display(last_job_df)
if(len(last_job_df.columns) != 3):
    raise AssertionError(‘Wrong number of columns in last_job_df‘)
# CIS 545 Hidden Test 1.4.2.1 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.4.2.2 - please do not modify or delete this cell! # CIS 545 Hidden Test 1.4.2.3 - please do not modify or delete this cell!
Complete the recommend_jobs function below that takes recs_df , names_df , and last_job_df as
person_2 ‘s most recent title and org .
Step 1.4.3: Recommend jobs # TODO: Complete the recommend_jobs function
def recommend_jobs(recs_df, names_df, last_job_df):
    # YOUR CODE HERE
    raise NotImplementedError()
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 10/12 # CIS 545 Sanity Check 1.4.3 - please do not modify or delete this cell!
recommended_df = recommend_jobs(recs_df, names_df, last_job_df)
display(recommended_df)
if "family_name" not in recommended_df:
    raise AssertionError("person_1 column not defined")
if "given_name" not in recommended_df:
    raise AssertionError("person_2 column not defined")
if "person_2" not in recommended_df:
    raise AssertionError("common_skills column not defined")
if "org" not in recommended_df:
    raise AssertionError("common_skills column not defined")
if "title" not in recommended_df:
    raise AssertionError("common_skills column not defined")
# CIS 545 Hidden Test 1.4.3 - please do not modify or delete this cell!
This last section relates to our discussions in lecture about computation eciency
with big data.
Step 2: Compare Evaluation Orders
Let‘s look at some computation and optimization tasks. We‘ll start with the code from our lecture note
dataframes.
Step 2.0: Load custom functions # Join using nested loops
def merge(S,T,l_on,r_on):
    ret = pd.DataFrame()
    count = 0
    S_ = S.reset_index().drop(columns=[‘index‘])
    T_ = T.reset_index().drop(columns=[‘index‘])
    for s_index in range(0, len(S)):
        for t_index in range(0, len(T)):
            count = count + 1
            if S_.loc[s_index, l_on] == T_.loc[t_index, r_on]:
                ret = ret.append(S_.loc[s_index].append(T_.loc[t_index].drop(labels=r_
    print(‘Merge compared %d tuples‘%count)
    return ret
  
# Join using a *map*, which is a kind of in-memory index
# from keys to (single) values
def merge_map(S,T,l_on,r_on):
    ret = pd.DataFrame()
T map {}
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 11/12     T_map = {}
    count = 0
    # Take each value in the r_on field, and
    # make a map entry for it
    T_ = T.reset_index().drop(columns=[‘index‘])
    for t_index in range(0, len(T)):
        # Make sure we aren‘t overwriting an entry!
        assert (T_.loc[t_index,r_on] not in T_map)
        T_map[T_.loc[t_index,r_on]] = T_.loc[t_index]
        count = count + 1
    # Now find matches
    S_ = S.reset_index().drop(columns=[‘index‘])
    for s_index in range(0, len(S)):
        count = count + 1
        if S_.loc[s_index, l_on] in T_map:
                ret = ret.append(S_.loc[s_index].append(T_map[S_.loc[s_index, l_on]].d
    print(‘Merge compared %d tuples‘%count)
    return ret
Reimplement recommend_jobs using the above merge or merge_map functions instead of Pandas‘ m
You should start with the dataframes recs_df , names_df , and last_job_df from above. Store your
Step 2.1: Find an optimal order of evaluation. # TODO: Reimplement recommend jobs using our custom merge and merge_map functions
def recommend_jobs_new(recs_df, names_df, last_job_df):
    # YOUR CODE HERE
    raise NotImplementedError()
# CIS 545 Sanity Check 2.1 - please do not modify or delete this cell!
%%time
recs_new_df = recommend_jobs_new(recs_df, names_df, last_job_df)
if(len(recs_new_df.columns) != 5):
    raise AssertionError(‘Wrong number of columns in recs_new_df‘)
1. When you are done, select “Edit” at the top of the window, under the lename,
not the one that may appear ab
do this just before turning is your homework because it reduces the size of your le.
Step 3: Submitting Your Homework
11/3/2019 Homework_2.ipynb - Colaboratory
https://colab.research.google.com/drive/1K0hp-Y5R7FHa3AwfAj2tCw3ueXOlJhay#scrollTo=syxh_fwyTAVU 12/12
2. In the same menu under the lename,
select “File” and then “Download .ipynb”. It is very importa
of this downloaded notebook. Make sure that something like “(1)” did not get added to the lena
the .py version. Our autograder can only handle .ipynb les
with the correct le
name.
3. Compress the ipynb le
into a Zip le
hw2.zip.
4. Go to the submission site, and click on the Google icon. Log in using your [email protected] (if at al
student) GMail account.
5. Click on the Courses icon at the top, then select CIS 545 and Save. Select cis545-2019c-hw2 an
6. You should see a message on the submission site notifying you about whether your submission
necessary, but may have to withdraw your previous submission in OpenSubmit in order to do so.
If you have not already, please go to Settings and set your Student ID to your PennID (all numbers).

因为专业,所以值得信赖。如有需要,请加QQ:99515681 或邮箱:[email protected]

微信:codehelp

原文地址:https://www.cnblogs.com/welpython2/p/11822225.html

时间: 2024-10-28 04:55:05

CIS 545 - Big Data Analytics的相关文章

Toward Scalable Systems for Big Data Analytics: A Technology Tutorial (I - III)

ABSTRACT Recent technological advancement have led to a deluge of data from distinctive domains (e.g., health care and scientific sensors, user-generated data, Internet and financial companies, and supply chain systems) over the past two decades. The

Big Data Analytics for Security(Big Data Analytics for Security Intelligence)

http://www.infoq.com/articles/bigdata-analytics-for-security This article first appeared in the IEEE Security & Privacymagazine and is brought to you by InfoQ & IEEE Computer Society. Enterprises routinely collect terabytes of security-relevant da

Big Data Analytics and Data Mining 第一天.

今天是上课的第一天.真心很感激导师能让我出来学习.今天突然觉得自己要好好学习英语.并不是上课的时候我看不懂裴教授的课件.而是觉得如果英语不好就很像乡巴佬那样,很难接触到高级的东西. 通过今天的听讲,我感觉对数据挖掘的理解更深刻些. 以前总觉得自己研究生的目标是要好好学习算法,好好学习相关的技术. 现在觉得除了要好好学习算法外,我也期待自己能做出一些研究. 记录下今天讲课的内容. 今天我觉得主要讲了三部分: 1,数据挖掘相关的概念及相关的学术期刊. 从广义上来定义数据挖掘:The art of d

12 Top Open Source Data Analytics Apps

1. Hadoop It would be impossible to talk about open source data analytics without mentioning Hadoop. This Apache Foundation project has become nearly synonymous with big data, and it enables large-scale distributed processing of extremely large data

IAB303 Data Analytics Assessment Task

Assessment TaskIAB303 Data Analyticsfor Business InsightSemester I 2019Assessment 2 – Data Analytics NotebookName Assessment 2 – Data Analytics NotebookDue Sun 28 Apr 11:59pmWeight 30% (indicative weighting)Submit Jupyter Notebook via BlackboardRatio

说一说BDAS(Berkeley Data Analytics Stack)

Strata+Hadoop World 2016在San Jose刚刚结束.对于大数据从业者来讲,这是一定要关注的一个盛会.其中有一个keynote,是Berkeley大学的Michael Franklin的关于BDAS的未来的发展的,非常值得关注,你要问我为什么? BDAS乃是伯克利大学的AMPLab打造的用于大数据的分析的一套开源软件栈,这其中包括了这两年火的爆棚的Spark,也包括了冉冉升起的分布式内存系统Alluxio(Tachyon),当然还包括著名的资源管理的开源软件Mesos.可以

Commonly used terms in Data and Analytics

General terms Analytics as a Service (AaaS) The provision of analytics through Web-delivered technologies. These solutions offer businesses an alternative to developing internal hardware setups to perform business analytics. Artificial Intelligence (

Data Visualisation and Analytics

Data Visualisation and Analytics Assignment 3Department of Econometrics and Business Statistics, Monash UniversityDue Date: 24th October 2019 at 1PMA Implementing kNN classification (10 Marks)This part of the assignment involves kNN classification of

MySQL vs. MongoDB: Choosing a Data Management Solution

原文地址:http://www.javacodegeeks.com/2015/07/mysql-vs-mongodb.html 1. Introduction It would be fair to say that as IT professionals we are living in the golden age of data management era. As our software systems become more complex and more distributed,