Building a Command Line JSON Splitter
I’m not a developer. I have some programming background but I’d say I’m better at following tutorials than I am at programming. I started this project because I was tired of manually copy/pasting large JSON files to split them into smaller pieces. The project is available at https://github.com/jhsu98/json-splitter, but if you’d like to learn how I built it then continue on.
Before we get started, I’m going to assume you’ve installed Python 3.x and know that to run the script you’ll want to type
First things first, create a file named
json-splitter.py. We'll need three external modules for this project...
- os: responsible for opening and reading the file
- json: responsible for the decoding and encoding of json data
- math: necessary for calculating the number of files when splitting the JSON file
Our first step will be importing these three modules and printing a welcome message.
import os import json import math print('Welcome to the JSON Splitter') print('First, enter the name of the file you want to split')
In order to split a JSON file we need to ask the user for one. We're going to use a try/except block to prompt the user for the file, open the file, and check for a JSON Array. If any of these three things fail then the script cannot work and we'll exit.
try: # request file name file_name = input('filename: ') f = open(file_name) file_size = os.path.getsize(file_name) data = json.load(f) if isinstance(data, list): data_len = len(data) print('Valid JSON file found') else: print("JSON is not an Array of Objects") exit() except: print('Error loading JSON file ... exiting') exit()
Some highlights from the code above:
input()function is used to prompt the user for text input
- We use
os.path.getsize()to get the file size which is needed later when splitting.
isinstance()function is used to make sure the JSON file is an Array, presumably of Objects
- If any of the code in the try section causes an error, the except block will execute
Now that the JSON file has been loaded into the variable data, it's time to find out how to split the file. We're going to split our file based on a maximum file size for each chunk. If the chunk size is smaller than the file then we'll prompt the user and gracefully exit the script.
# get numeric input try: mb_per_file = abs(float(input('Enter maximum file size (MB): '))) except: print('Error entering maximum file size ... exiting') exit() # check that file is larger than max size if file_size < mb_per_file * 1000000: print('File smaller than split size, exiting') exit() # determine number of files necessary num_files = math.ceil(file_size/(mb_per_file*1000000)) print('File will be split into',num_files,'equal parts')
The most important aspect of the above code is that we convert the String input to a Float (and take the absolute value for good measure). To finish up, we'll calculate, store and print the number of files that will be created based on the maximum file size.
Okay, next is to set up the data structure for holding the split pieces and the cutoffs for each piece. We'll use a 2D Array—also known as an Array of Arrays. Think of a 2D Array as a spreadsheet with an Y-axis (number of nested arrays) and an X-axis (size of each nested array). We'll create the correct number of nested arrays by using a for loop with the calculated number of chunks. To find the cutoff points lets divide the length of the JSON Array by the number of files. Last, add the length to the end of the array of indices.
# initialize 2D array split_data = [ for i in range(0,num_files)] # determine indices of cutoffs in array starts = [math.floor(i * data_len/num_files) for i in range(0,num_files)] starts.append(data_len)
We're all set to slice up our array. For every chunk to create, we'll loop through the JSON array starting at the current cutoff index and stopping at the next cutoff index. As we complete each chunk we'll make a file and write the chunk accordingly. Print a section complete message then finally once the loop is done a message letting the user know the entire script has completed.
# loop through 2D array for i in range(0,num_files): # loop through each range in array for n in range(starts[i],starts[i+1]): split_data[i].append(data[n]) # create file when section is complete name = os.path.basename(file_name).split('.') + '_' + str(i+1) + '.json' with open(name, 'w') as outfile: json.dump(split_data[i], outfile) print('Part',str(i+1),'... completed') print('Success! Script Completed')
Hope you enjoyed the tutorial! Remember to check out the full script on GitHub: https://github.com/jhsu98/json-splitter. If you have any questions please let me know. Feel free to follow me on Twitter, Instagram, or Medium for updates.