Data Science at the Command Line
Data Science at the Command Line is a new book written by Jeroen Janssens. This website contains information about the upcoming workshop in London, the webcast from August 20th, instructions on how to install the Data Science Toolbox, and an overview of all the command-line tools discussed in the book.
This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You‘ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.
Discover why the command line is an agile, scalable, and extensible technology. Even if you‘re already comfortable processing data with, say, Python or R, you‘ll greatly improve your data science workflow by also leveraging the power of the command line.
Upcoming Workshop in London
On February 20 and 21, 2015 in London, I‘ll give a two-day workshop about Data Science at the Command Line. This workshop is organized by The School of Data Science. The registration page contains information about the pricing, an overview of the topics we‘ll cover, and the prerequisites needed for this workshop. You‘ll receive a free copy of the book.
Webcast Recording Now Available
On August 20, 2014 at 17:00 UTC, I did a two-hour webcast about the book. With over 1,000 attendees, the webcast was a great success. In case you missed it you can view the recording. You may need to sign up or log in.
If you want to follow along with the code and commands presented during the webcast, you can install the Data Science Toolbox. This is an easy-to-install virtual machine that contains all the command-line tools discussed in the book. It runs on Microsoft Windows, Mac OS X, and Linux. See below for installation instructions.
Installing the Data Science Toolbox
In both the book and the webcast we use many different command-line tools. The distribution of GNU/Linux that we assume, Ubuntu, comes with a whole bunch of command-line tools pre-installed. Moreover, Ubuntu offers many packages that contain other, relevant command-line tools. Installing these packages yourself is not too difficult. However, we also use command-line tools that are not available as packages and require a more manual, and more involved, installation. In order to acquire the necessary command-line tools without having to go through the involved installation process of each, we encourage you to install the Data Science Toolbox.
The Data Science Toolbox is a virtual environment that allows you to get started doing data science in minutes. The default version comes with commonly used software for data science, including the Python scientific stack and R together with its most popular packages. Additional software and data bundles are easily installed. These bundles can be specific to a certain book, course, or organization. You can read more about the Data Science Toolbox in general athttp://datasciencetoolbox.org.
While there exists a bundle for this book, which contains all the data, scripts, and command-line tools used in this book, we have also created a pre-bundled Data Science Toolbox specifically for this book. The following five steps describe how to install it.
Step 1: Download and Install VirtualBox
Browse to the Virtualbox download page (https://www.virtualbox.org/wiki/Downloads) and download the appropriate binary for your operating system. Open the binary and follow the installations instructions.
Step 2: Download and Install Vagrant
Similarly to Step 1, browse to the Vagrant (HashiCorp, 2014) download page (http://www.vagrantup.com/downloads.html) and download the appropriate binary. Open the binary and follow the installations instructions. If you already have Vagrant installed, please make sure that it’s version 1.5 or higher.
Step 3: Download and Start the Data Science Toolbox
Open a terminal (known as the command prompt in Microsoft Windows). Create a directory, for example MyDataScienceToolbox, and navigate to it by typing:
$ mkdir MyDataScienceToolbox $ cd MyDataScienceToolbox
In order to initialize the Data Science Toolbox, run the following command:
$ vagrant init data-science-toolbox/data-science-at-the-command-line
This creates a file named Vagrantfile. This is a configuration file that tells Vagrant how to launch the virtual machine. This file contains a lot of lines that are commented out. A minimal version is:
Vagrant.configure(2) do |config| config.vm.box = "data-science-toolbox/data-science-at-the-command-line" end
By running the following command, the Data Science Toolbox will be downloaded and booted.
$ vagrant up
If everything went well, then you now have a Data Science Toolbox running on your local machine.
If you ever see the message default: Warning: Connection timeout. Retrying...
printed repeatedly, then it may be that the virtual machine is waiting for input. This may happen when the virtual machine has not been properly shut down. In order to find out what’s wrong, add the following lines to Vagrantfile before the last end
statement:
config.vm.provider "virtualbox" do |vb| vb.gui = true end
This will cause VirtualBox to show a screen. Once the virtual machine has booted and you have identified the problem, you can remove these lines from Vagrantfile. The username and password are both vagrant.
Here‘s a slightly more elaborate Vagrantfile:
Vagrant.require_version ">= 1.5.0" Vagrant.configure(2) do |config| config.vm.box = "data-science-toolbox/data-science-at-the-command-line" config.vm.network "forwarded_port", guest: 8000, host: 8000 config.vm.provider "virtualbox" do |vb| vb.gui = true vb.memory = 2048 vb.cpus = 2 end end
This file, which you can download here, ensures that Vagrant:
- Is at least at version 1.5.0.
- Forwards port 8000, which is useful if you want to view a figure you created.
- Launches a graphical user interface.
- Uses 2 GB of memory.
- Uses 2 CPUs.
You can view more configuration option at http://docs.vagrantup.com/v2.
Step 4: Log in (on Linux and Mac OS X)
If you are running Linux, Mac OS X, or some other UNIX-like operating system, you can log in to the Data Science Toolbox by running the following command in a terminal:
$ vagrant ssh
After a few seconds you will be greeted with the following message:
Welcome to the Data Science Toolbox for Data Science at the Command Line Based on Ubuntu 14.04 LTS (GNU/Linux 3.13.0-24-generic x86_64) * Data Science at the Command Line: http://datascienceatthecommandline.com * Data Science Toolbox: http://datasciencetoolbox.org * Ubuntu documentation: http://help.ubuntu.com Last login: Tue Jul 22 19:33:16 2014 from 10.0.2.2
Step 4: Log in (on Microsoft Windows)
If you are running Microsoft Windows, you need to either run Vagrant with a graphical user interface (see above on how to set that up) or use a third-party application in order to log in to the Data Science Toolbox. We recommend PuTTY for this. Browse to the PuTTY download page (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html) and downloadputty.exe. Run PuTTY, and enter the following values:
- Host Name (or IP address): 127.0.0.1
- Port: 2222
- Connection type: SSH
If you want, you can save these values as a session by clicking the Save button, so that you do not need to enter these values again. Click the Open button and enter vagrant for both the username and the password.
Step 5: Get the Data and Scripts
Currently, the Data Science Toolbox does not contain all the data and scripts. An update will be uploaded once the book is finished. For now, in order to obtain the data and scripts, you run the following:
$ cd ~/book $ git pull
List of Command-line Tools
This is an overview of all the command-line tools discussed in the book. (Please note that, due to time constraints, the webcast will only discuss a subset.) This includes binary executables, interpreted scripts, and Bash builtins and keywords. For each command-line tool, we state, when available and appropriate, the following information:
- The actual command to type at the command-line.
- A description.
- The name of the package it belongs to.
- The version used in the book.
- The year that version was released.
- The primary author(s).
- A website to find more information.
- How to install it.
- How to obtain help.
- An example usage.
All command-line tools listed here are included in the Data Science Toolbox for Data Science at the Command Line. See the previous sections for instructions on how to set it up. The install commands assume that you‘re running Ubuntu 14.04. Please note that citing open source software is not trivial, and that some information may be missing or incorrect.
alias
Define or display aliases. Alias is a Bash builtin.
$ help alias $ alias ll=‘ls -alF‘
awk
Pattern scanning and text processing language. Mawk (version 1.3.3) by Mike Brennan (1994).http://invisible-island.net/mawk.
$ sudo apt-get install mawk $ man awk $ seq 5 | awk ‘{sum+=$1} END {print sum}‘ 15
aws
Manage AWS Services such as EC2 and S3 from the command line. AWS Command Line Interface (version 1.3.24) by Amazon Web Services (2014). http://aws.amazon.com/cli.
$ sudo pip install awscli $ aws help $ aws ec2 describe-regions | head -n 5 { "Regions": [ { "Endpoint": "ec2.eu-west-1.amazonaws.com", "RegionName": "eu-west-1"
bash
GNU Bourne-Again SHell. Bash (version 4.3) by Brian Fox and Chet Ramey (2010).http://www.gnu.org/software/bash.
$ sudo apt-get install bash $ man bash
bc
Evaluate equation from standard input. Bc (version 1.06.95) by Philip A. Nelson (2006).http://www.gnu.org/software/bc.
$ sudo apt-get install bc $ man bc $ echo ‘e(1)‘ | bc -l 2.71828182845904523536
bigmler
Access BigML’s prediction API. BigMLer (version 1.12.2) by BigML (2014).http://bigmler.readthedocs.org.
$ sudo pip install bigmler $ bigmler --help
body
Apply expression to all but the first line. Useful if you want to apply classic command-line tools to CSV files with a header. Body by Jeroen H.M. Janssens (2014).https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git $ echo -e "value\n7\n2\n5\n3" | body sort -n value 2 3 5 7
cat
Concatenate files and standard input, and print on standard output. Cat (version 8.21) by Torbjorn Granlund and Richard M. Stallman (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man cat $ cat results-01 results-02 results-03 > results-all
cd
Change the shell working directory. Cd is a Bash builtin.
$ help cd $ cd ~; pwd; cd ..; pwd /home/vagrant /home
chmod
Change file mode bits. We use it to make our command-line tools executable. Chmod (version 8.21) by David MacKenzie and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man chmod $ chmod u+x experiment.sh
cols
Apply a command to a subset of the columns and merge the result back with the remaining columns. Cols by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git $ < iris.csv cols -C species body tapkee --method pca | header -r x,y,species
cowsay
Generate an ASCII picture of a cow with a message. Particularly useful when building up a particular pipeline is starting to frustrate you a bit too much. Cowsay (version 3.03+dfsg1) by Tony Monroe (1999).
$ sudo apt-get install cowsay $ man cowsay $ echo ‘The command line is awesome!‘ | cowsay ______________________________ < The command line is awesome! > ------------------------------ \ ^__^ \ (oo)\_______ (__)\ )\/ ||----w | || ||
cp
Copy files and directories. Cp (version 8.21) by Torbjorn Granlund, David MacKenzie, and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man cp
csvcut
Extract columns from CSV data. Like cut
command-line tool, but for tabular data. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit $ csvcut --help
csvgrep
Filter tabular data to only those rows where certain columns contain a given value or match a regular expression. Csvkit (version 0.8.0) by Christopher Groskopf (2014).http://csvkit.readthedocs.org.
$ sudo pip install csvkit $ csvgrep --help
csvjoin
Merge two or more CSV tables together using a method analogous to SQL JOIN operation. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit $ csvjoin --help
csvlook
Renders a CSV file to the command line in a readable, fixed-width format. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit $ csvlook --help $ echo -e "a,b\n1,2\n3,4" | csvlook |----+----| | a | b | |----+----| | 1 | 2 | | 3 | 4 | |----+----|
csvsort
Sort CSV files. Like the sort
command-line tool, but for tabular data. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit $ csvsort --help
csvsql
Execute SQL queries directly on CSV data or insert CSV into a database. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit $ csvsql --help
csvstack
Stack up the rows from multiple CSV files, optionally adding a grouping value to each row. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit $ csvstack --help
csvstat
Print descriptive statistics for all columns in a CSV file. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit $ csvstat --help
curl
Download data from a URL. cURL (version 7.35.0) by Daniel Stenberg (2012).http://curl.haxx.se.
$ sudo apt-get install curl $ man curl
cut
Remove sections from each line of files. Cut (version 8.21) by David M. Ihnat, David MacKenzie, and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man cut
display
Display an image or image sequence on any X server. Can read image data from standard input. Display (version 8:6.7.7.10) by ImageMagick Studio LLC (2009). http://www.imagemagick.org.
$ sudo apt-get install imagemagick $ man display
drake
Manage a data workflow. Drake (version 0.1.6) by Factual (2014).https://github.com/Factual/drake.
$ # Please see Chapter 6 for installation instructions. $ drake --help
dseq
Generate sequence of dates relative to today. Dseq by Jeroen H.M. Janssens (2014).https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git $ dseq -2 0 # day before yesterday till today 2014-07-15 2014-07-16 2014-07-17
echo
Display a line of text. Echo (version 8.21) by Brian Fox and Chet Ramey (2012).http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man echo
env
Run a program in a modified environment. It’s often used to specify which interpreter should run our script. Env (version 8.21) by Richard Mlynarik and David MacKenzie (2012).http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man env $ #!/usr/bin/env python
export
Set export attribute for shell variables. Useful for making shell variables available to other command-line tools. Export is a Bash builtin.
$ help export $ export WEKAPATH=$HOME/bin
feedgnuplot
Generate a script for gnuplot
while passing data to standard input. Feedgnuplot (version 1.32) by Dima Kogan (2014). http://search.cpan.org/perldoc?feedgnuplot.
$ sudo apt-get install feedgnuplot $ man feedgnuplot
find
Search for files in a directory hierarchy. Find (version 4.4.2) by James Youngman (2008).http://www.gnu.org/software/findutils.
$ sudo apt-get install findutils $ man find
for
Execute commands for each member in a list. In Chapter 8, we discuss the advantages of usingparallel
instead of for
. For is a Bash keyword.
$ help for $ for i in {A..C} "It‘s easy as" {1..3}; do echo $i; done A B C It‘s easy as 1 2 3
git
Manage repositories for git, which is a distributed version control system. Git (version 1:1.9.1) by Linus Torvalds and Junio C. Hamano (2014). http://git-scm.com.
$ sudo apt-get install git $ man git
grep
Print lines matching a pattern. Grep (version 2.16) by Jim Meyering (2012).http://www.gnu.org/software/grep.
$ sudo apt-get install grep $ man grep
head
Output the first part of files. Head (version 8.21) by David MacKenzie and Jim Meyering (2012).http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man head $ seq 5 | head -n 3 1 2 3
header
Add, replace, and delete header lines. Header by Jeroen H.M. Janssens (2014).https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git $ header -h
in2csv
Convert common, but less awesome, tabular data formats to CSV. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit $ in2csv --help
jq
Process JSON. Jq (version jq-1.4) by Stephen Dolan (2014). http://stedolan.github.com/jq.
$ # See website for installation instructions $ # See website for documentation
json2csv
Convert JSON to CSV. Json2Csv (version 1.1) by Jehiah Czebotar (2014).https://github.com/jehiah/json2csv.
$ go get github.com/jehiah/json2csv $ json2csv --help
less
Paginate large files. Less (version 458) by Mark Nudelman (2013).http://www.greenwoodsoftware.com/less.
$ sudo apt-get install less $ man less $ csvlook iris.csv | less
ls
List directory contents. Ls (version 8.21) by Richard M. Stallman and David MacKenzie (2012).http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man ls
man
Read reference manuals of command-line tools. Man (version 2.6.7.1) by John W. Eaton and Colin Watson (2014).
$ sudo apt-get install man $ man man $ man grep
mkdir
Make directories. Mkdir (version 8.21) by David MacKenzie (2012).http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man mkdir
mv
Move or rename files and directories. Mv (version 8.21) by Mike Parker, David MacKenzie, and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man mv
parallel
Build and execute shell command lines from standard input in parallel. GNU Parallel (version 20140622) by Ole Tange (2014). http://www.gnu.org/software/parallel.
$ # See website for installation instructions $ man parallel $ seq 3 | parallel echo Processing file {}.csv Processing file 1.csv Processing file 2.csv Processing file 3.csv
paste
Merge lines of files. Paste (version 8.21) by David M. Ihnat and David MacKenzie (2012).http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man paste
pbc
Run bc
with parallel
. First column of input CSV is mapped to {1}
, second to {2}
, and so forth. Pbc by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git $ seq 5 | pbc ‘{1}^2‘ 1 4 9 16 25
pip
Install and manage Python packages. Pip (version 1.5.4) by PyPA (2014). https://pip.pypa.io.
$ sudo apt-get install python-pip $ man pip
pwd
Print name of current working directory. Pwd (version 8.21) is a Bash builtin by Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ man pwd $ pwd /home/vagrant
python
Execute Python, which is an interpreted, interactive, and object-oriented programming language. Python (version 2.7.5) by Python Software Foundation (2014).http://www.python.org.
$ sudo apt-get install python $ man python
R
Analyze data and create visualizations with the R
programming language. To install the latest version of R
on Ubuntu, follow the instructions on http://cran.r-project.org/bin/linux/ubuntu/README. R (version 3.1.1) by R Foundation for Statistical Computing (2014). http://www.r-project.org.
$ sudo apt-get install r-base-dev $ man R
Rio
Load CSV from standard input into R as a data.frame, execute given commands, and get the output as CSV or PNG. Rio by Jeroen H.M. Janssens (2014).https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git $ Rio -h $ seq 10 | Rio -nf sum 55
Rio-scatter
Create a scatter plot from CSV using Rio
. Rio-Scatter by Jeroen H.M. Janssens (2014).https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git $ < iris.csv Rio-scatter sepal_length sepal_width species > iris.png
rm
Remove files or directories. Rm (version 8.21) by Paul Rubin, David MacKenzie, Richard M. Stallman, and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man rm
run_experiment
Run machine learning experiments with scikit-learn. SciKit-Learn Laboratory (version 0.26.0) by Educational Testing Service (2014). https://skll.readthedocs.org.
$ sudo pip install skll $ run_experiment --help
sample
Print lines from standard output with a given probability, for a given duration, and with a given delay between lines. Sample by Jeroen H.M. Janssens (2014).https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git $ sample --help
scp
Copy remote files securely. Scp (version 1:6.6p1) by Timo Rinne and Tatu Ylonen (2014).http://www.openssh.com.
$ sudo apt-get install openssh-client $ man scp
scrape
Extract HTML elements using an XPath query or CSS3 selector. Scrape by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git $ curl -sL ‘http://datasciencetoolbox.org‘ | scrape -e ‘head > title‘ <title>Data Science Toolbox</title>
sed
Filter and transform text. Sed (version 4.2.2) by Jay Fenlason, Tom Lord, Ken Pizzini, and Paolo Bonzini (2012). http://www.gnu.org/software/sed.
$ sudo apt-get install sed $ man sed
seq
Print a sequence of numbers. Seq (version 8.21) by Ulrich Drepper (2012).http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man seq $ seq 5 1 2 3 4 5
shuf
Generate random permutations. Shuf (version 8.21) by Paul Eggert (2012).http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man shuf
sort
Sort lines of text files. Sort (version 8.21) by Mike Haertel and Paul Eggert (2012).http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man sort
split
Split a file into pieces. Split (version 8.21) by Torbjorn Granlund and Richard M. Stallman (2012).http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man split
sql2csv
Executes arbitrary commands against an SQL database and outputs the results as a CSV. Csvkit (version 0.8.0) by Christopher Groskopf (2014). http://csvkit.readthedocs.org.
$ sudo pip install csvkit $ sql2csv --help
ssh
Login to remote machines. OpenSSH client (version 1.8.9) by Tatu Ylonen, Aaron Campbell, Bob Beck, Markus Friedl, Niels Provos, Theo de Raadt, Dug Song, and Markus Friedl (2014).http://www.openssh.com.
$ sudo apt-get install ssh $ man ssh
sudo
Execute a command as another user. Sudo (version 1.8.9p5) by Todd C. Miller (2013).http://www.sudo.ws/sudo.
$ sudo apt-get install sudo $ man sudo
tail
Output the last part of files. Tail (version 8.21) by Paul Rubin, David MacKenzie, Ian Lance Taylor, and Jim Meyering (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man tail $ seq 5 | tail -n 3 3 4 5
tapkee
Reduce dimensionality of a data set using various algorithms. Tapkee by Sergey Lisitsyn and Fernando Iglesias (2014). http://tapkee.lisitsyn.me.
$ # See website for installation instructions $ tapkee --help $ < iris.csv cols -C species body tapkee --method pca | header -r x,y,species
tar
Create, list, and extract tar archives. Tar (version 1.27.1) by Jeff Bailey, Paul Eggert, and Sergey Poznyakoff (2014). http://www.gnu.org/software/tar.
$ sudo apt-get install tar $ man tar
tee
Read from standard input and write to standard output and files. Tee (version 8.21) by Mike Parker, Richard M. Stallman, and David MacKenzie (2012).http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man tee
tr
Translate or delete characters. Tr (version 8.21) by Jim Meyering (2012).http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man tr
tree
List contents of directories in a tree-like format. Tree (version 1.6.0) by Steve Baker (2014).https://launchpad.net/ubuntu/+source/tree.
$ sudo apt-get install tree $ man tree
type
Display the type of a command-line tool. Type is a Bash builtin.
$ help type $ type cd cd is a shell builtin
uniq
Report or omit repeated lines. Uniq (version 8.21) by Richard M. Stallman and David MacKenzie (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man uniq
unpack
Extract common file formats. Unpack by Patrick Brisbin (2013).https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git $ unpack file.tgz
unrar
Extract files from rar archives. Unrar (version 1:0.0.1+cvs20071127) by Ben Asselstine, Christian Scheurer, and Johannes Winkelmann (2014). http://home.gna.org/unrar.
$ sudo apt-get install unrar-free $ man unrar
unzip
List, test and extract compressed files in a ZIP archive. Unzip (version 6.0) by Samuel H. Smith (2009). http://www.info-zip.org/pub/infozip.
$ sudo apt-get install unzip $ man unzip
wc
Print newline, word, and byte counts for each file. Wc (version 8.21) by Paul Rubin and David MacKenzie (2012). http://www.gnu.org/software/coreutils.
$ sudo apt-get install coreutils $ man wc $ echo ‘hello world‘ | wc -c 12
weka
Weka is a collection of machine learning algorithms for data mining tasks by Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. This command-line tool allows you to run Weka from the command line. Weka command-line tool by Jeroen H.M. Janssens (2014). https://github.com/jeroenjanssens/data-science-at-the-command-line.
$ git clone https://github.com/jeroenjanssens/data-science-at-the-command-line.git
which
Locate a command-line tool. Does not work for Bash builtins. Which by Unknown (2009).
$ man which $ which man /usr/bin/man
xml2json
Convert XML to JSON Xml2Json (version 0.0.2) by Francois Parmentier (2014).https://github.com/parmentf/xml2json.
$ npm install xml2json-command $ xml2json < input.xml > output.json