Jonathan Hsu
Building a Command Line JSON Splitter 2019-01-29

Building a Command Line JSON Splitter

An Image

I’m not a developer. I have some programming background but I’d say I’m better at following tutorials than I am at programming. I started this project because I was tired of manually copy/pasting large JSON files to split them into smaller pieces. The project is available at https://github.com/jhsu98/json-splitter, but if you’d like to learn how I built it then continue on.

Introduction

Before we get started, I’m going to assume you’ve installed Python 3.x and know that to run the script you’ll want to type python3 json-splitter.py

Getting Started

First things first, create a file named json-splitter.py. We'll need three external modules for this project...

  • os: responsible for opening and reading the file
  • json: responsible for the decoding and encoding of json data
  • math: necessary for calculating the number of files when splitting the JSON file

Our first step will be importing these three modules and printing a welcome message.

import os
import json
import math

print('Welcome to the JSON Splitter')
print('First, enter the name of the file you want to split')

In order to split a JSON file we need to ask the user for one. We're going to use a try/except block to prompt the user for the file, open the file, and check for a JSON Array. If any of these three things fail then the script cannot work and we'll exit.

try:
    # request file name
    file_name = input('filename: ')
    f = open(file_name)
    file_size = os.path.getsize(file_name)
    data = json.load(f)

    if isinstance(data, list):
        data_len = len(data)
        print('Valid JSON file found')
    else:
        print("JSON is not an Array of Objects")
        exit()

except:
    print('Error loading JSON file ... exiting')
    exit()

Some highlights from the code above:

  • The input() function is used to prompt the user for text input
  • We use os.path.getsize() to get the file size which is needed later when splitting.
  • The isinstance() function is used to make sure the JSON file is an Array, presumably of Objects
  • If any of the code in the try section causes an error, the except block will execute

Now that the JSON file has been loaded into the variable data, it's time to find out how to split the file. We're going to split our file based on a maximum file size for each chunk. If the chunk size is smaller than the file then we'll prompt the user and gracefully exit the script.

# get numeric input
try:
    mb_per_file = abs(float(input('Enter maximum file size (MB): ')))
except:
    print('Error entering maximum file size ... exiting')
    exit()

# check that file is larger than max size
if file_size < mb_per_file * 1000000:
    print('File smaller than split size, exiting')
    exit()

# determine number of files necessary
num_files = math.ceil(file_size/(mb_per_file*1000000))
print('File will be split into',num_files,'equal parts')

The most important aspect of the above code is that we convert the String input to a Float (and take the absolute value for good measure). To finish up, we'll calculate, store and print the number of files that will be created based on the maximum file size.

Okay, next is to set up the data structure for holding the split pieces and the cutoffs for each piece. We'll use a 2D Array—also known as an Array of Arrays. Think of a 2D Array as a spreadsheet with an Y-axis (number of nested arrays) and an X-axis (size of each nested array). We'll create the correct number of nested arrays by using a for loop with the calculated number of chunks. To find the cutoff points lets divide the length of the JSON Array by the number of files. Last, add the length to the end of the array of indices.

# initialize 2D array
split_data = [[] for i in range(0,num_files)]

# determine indices of cutoffs in array
starts = [math.floor(i * data_len/num_files) for i in range(0,num_files)]
starts.append(data_len)

We're all set to slice up our array. For every chunk to create, we'll loop through the JSON array starting at the current cutoff index and stopping at the next cutoff index. As we complete each chunk we'll make a file and write the chunk accordingly. Print a section complete message then finally once the loop is done a message letting the user know the entire script has completed.

# loop through 2D array
for i in range(0,num_files):
    # loop through each range in array
    for n in range(starts[i],starts[i+1]):
        split_data[i].append(data[n])

    # create file when section is complete
    name = os.path.basename(file_name).split('.')[0] + '_' + str(i+1) + '.json'
    with open(name, 'w') as outfile:
        json.dump(split_data[i], outfile)

    print('Part',str(i+1),'... completed')

print('Success! Script Completed')

Hope you enjoyed the tutorial! Remember to check out the full script on GitHub: https://github.com/jhsu98/json-splitter. If you have any questions please let me know. Feel free to follow me on Twitter, Instagram, or Medium for updates.