Partner links

Distributed data analysis with plain UNIX commands and Docker Swarm

Docker

Editor: For setting up the Docker Swarm cluster used in this article, the author uses Docker Machine. Keep that in mind because the pre-stable version of Docker has orchestration built-in, so Docker Machine is about to go the way of the dodo.

The purpose of this post is to show how powerful and flexible Docker Swarm can be when combined with standard UNIX tools to analyze data in a distributed fashion. To do this, let’s write a simple MapReduce implementation in bash/sh that uses Docker Swarm to schedule Map jobs on nodes across the cluster.

MapReduce is usually implemented when there’s a large dataset to process. For the sake of simplicity and for reproducibility by the reader, we’re using a very small dataset composed of a few megabytes of text files.

This post is not about showing you how to write a MapReduce program. It’s also not about suggesting that MapReduce is best done in this way. Instead, this post is about making you aware that the plain old UNIX tools such as sort, awk, netcat, pv, uniq, xargs, pipe, join, time, and cat can be useful for distributed data processing when running on top of a Docker Swarm cluster.

Read the complete article here.

Docker

Share:

Share on facebook
Facebook
Share on twitter
Twitter
Share on pinterest
Pinterest
Share on linkedin
LinkedIn

Partner links

Newsletter: Subscribe for updates

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

Get the latest

On social media

Security distros

Hacker
Linux distros for hacking and pentesting

Crypto mining OS

Bitcoin
Distros for mining bitcoin and other cryptocurrencies

Crypto hardware

MSI GeForce GTX 1070
Installing Nvidia GTX 1070 GPU drivers on Ubuntu

Disk guide

LVM
Beginner's guide to disks & disk partitions in Linux

Bash guide

Bash shell terminal
How to set the PATH variable in Bash
Categories
Archives
0
Hya, what do you think? Please comment.x
()
x