Bottom Line: I wrote a script to go through my CrashPlan log and find out which directories were being backed up most frequently.

I have a local CrashPlan backup that goes to my Raspberry Pi. It could be a little faster, but it generally works pretty well.

A week or so ago, I finally completed a full sync after not having done so in a couple weeks. The next day, I noticed that I already had a few GB of changes queued up to sync, after relatively light use and no new large files I could think of. I was curious as to what was going on, so I went searching through my CrashPlan logs.

Unfortunately, just looking at the raw logs didn’t give me the best idea — there are just too many files to wrap my head around. So instead, I wrote up a quick script that sorts through the most recent log of backed up files and outputs a text file with the name of each directory and number of times it was referenced in the backup log, sorted by count. I found that there were several directories that had tons of frequently modified files that I didn’t really need to be backing up at all. I added these directories to CrashPlan’s Settings -> Backup -> Filename exclusions: and have been pleased with the results.

#! /usr/bin/env python3
'''crashplan_dirs.py
Takes the crashplan log and sorts it by the most commonly used directories.
As of 20140907 only configured for Mac OSX.
'''
import re
import collections
import sys
import logging
OUTPUT_FILE = 'crashplan_dirs.txt'
logging.basicConfig(
level=logging.WARNING,
format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
datefmt='%Y-%m-%d %H:%M:%S',
# filename='crashplan_dirs.log',
# filemode='a'
)
logger_name = str(__file__) + " :: " + str(__name__)
logger = logging.getLogger(logger_name)
if sys.platform == 'darwin':
logfile = '/Library/Logs/CrashPlan/backup_files.log.0'
try:
with open(logfile, 'r') as f:
lines = f.readlines()
except FileNotFoundError as e:
logger.exception("Unable to find the CrashPlan log file. Make sure you "
"have the right file set for your system.")
regex = re.compile(r'^I \d{2}/\d{2}/\d{2} \d{2}:\d{2}[AP]M \d+ \w+ \d (/.*?)$')
paths = [path.group(1) for path in [re.match(regex, line) for line in lines]
if path]
dirs = [re.match(r'^.*/', path).group(0) for path in paths]
test = collections.Counter(dirs)
output = sorted(test.items(), key=lambda x: x[1], reverse=True)
with open(OUTPUT_FILE, 'w') as f:
f.write('\n'.join(['{}: {}'.format(v, k) for (k, v) in output]))