Most useful terminal commands for data processing on linux / mac OSVery often you want to have a look to a csv file, but the columns are not aligned. To solve this, use, column command:
column -t -s','
The before/after example below shows how easier it is to read well aligned data (using public data from the opencellid database):
Bash for loops are super useful whenever you have data spread over several files. A quick example to get all .dat files from a range of folders stored on amazon S3:
for ((day=12; day <=28; day++))
s3cmd get s3://xxx/bucket/incoming/201606$day/*.dat;
To know the number of lines in the file:
wc -l myfile
To preview the first 17 lines of the file
head -17 myfile
To extract column 2 to 4 and 7 from a file where the separator character is pipe "|"
cut -d"|" -f2-4,7 myfile
To count the number of occurrences of every value present in column 3.
- In previous command, do not forget the "sort" command! Uniq counts only adjacent duplicates. If you do not sort, your results will not be what you want.
- "sort" can use multiple processors in parallel, this can be handy if the file is super large. To use this feature, just use the appropriate option, e.g., "sort --parallel=8".
Most useful Hadoop commands
In our Hadoop driven world, there are chances Linux commands won't be enough and you will need to interact with HDFS / other Hadoop components too. Fortunately, this is pretty easy and the commands are very similar to standard terminal commands:
hdfs dfs -ls /myfolder
hdfs dfs -cat /myfolder/myfile.csv
Have fun, and if you like this post, please share it on social networks or link it from you websites. It will help it to appear higher in Google results and thus to be found by other people more easily. Thanks!