Figuring out which files are touched while installing software

We are working on getting our infrastructure up to speed for Maemo 5 and Maemo 5 QT builds. A critical part of Maemo5 building is the scratchbox. This is a toolkit that Nokia uses to make development on for their linux based phones easier. We have enough linux build slaves in production that it is impractical to deploy scratchbox by hand on each machine. Scratchbox also does internet package downloads which means that we could get different packages each time we try to install the scratchbox. We already have a fairly old version of scratchbox which is set up with the Chinook sdk that we have used for doing our Maemo builds thus far. Originally I was under the impression that we were going to need to have 2 totally seperate scratchbox installations but thanks to Doug T. for showing me how to upgrade our existing scratchbox 4 installation to scratchbox 5.

My concern with this upgrade, however, was that files outside the /builds/scratchbox directory were going to be touched. I wanted to be thorough so I did an experiment. I ran find -mount -type f -exec openssl md5 '{}' \; | tee -a /file-list ; find -mount -type f -exec openssl md5 '{}' \; | tee -a /file-list before and after the scratchbox upgrade. The -mount and two different runs at our two mountpoints was to ensure that we didn’t hash things like the /dev, /proc, /sys filesystem. My original intent was to do diff file-list1 file-list2 but that resulted in showing me every single file that changed. I only wanted to know the files that changed outside of my scratchbox root directory of /builds/scratchbox. My diff was polluted by 77,000 files that resided in the scratcbox root. I figured that the best option at the time was to hack up a quick python script:

#!/usr/bin/python
#This file is a quick script to process the output of
# find / -mount -type f -exec openssl md5 '{}' \; | tee -a
import sys, os.path, re

if not len(sys.argv) == 3:
    print "purple monkey dishwasher"
    exit(1)
filename_a = sys.argv[1]
filename_b = sys.argv[2]
if not os.path.exists(filename_a) or not os.path.exists(filename_b):
    print "insert change into meter and press green button"
    exit(1)
data={}
pattern = re.compile("^MD5\((?P.*)\)= (?P.*)$")
#Get the data from A
f = open(filename_a, 'r')
for i in f.readlines():
    m = pattern.search(i)
    data[m.group('file')] = m.group('hash')
f.close()
f = open(filename_b, 'r')
sbfile = re.compile("^/builds/scratchbox") #pattern describing files to ignore
#Figure out diff to B
f = open(filename_b, 'r')
for i in f.readlines():
    m = pattern.search(i)
    if not data.has_key(m.group('file')):
        if not sbfile.search(m.group('file')):
            print 'new file - ', m.group('file')
    else:
         if not sbfile.search(m.group('file')):
             if not data[m.group('file')] == m.group('hash'):
                 print 'updated file - ', m.group('file')

This is code I have written to scratch my own itch. I am posting this as it might be useful to someone else. if you wanted to ignore a different directory you’d change sbfile = re.compile("^/builds/scratchbox") to be a pattern describing your path to ignore. If you wanted to find all things that changed over your whole partition you would remove sbfile and all sbfile checks to have a final bit of code like

#Figure out diff to B
f = open(filename_b, 'r')
for i in f.readlines():
    m = pattern.search(i)
    if not data.has_key(m.group('file')):
        print 'new file - ', m.group('file')
    else:
         if not data[m.group('file')] == m.group('hash'):
             print 'updated file - ', m.group('file')

In the end, I found that the scratchbox upgrade that I did only changed my bash_history and added some tarballs to /tmp. I am very glad that this is the case as it really simplifies our deployment of the new scratchbox!

4 Responses to Figuring out which files are touched while installing software

  1. Why not just

    diff -U 0 file-list1 file-list2 | grep -vF /builds/scratchbox/

    (or similar) to display just those differences that don’t contain the string /builds/scratchbox/ ?

  2. Wouldn’t it have been easier to just move /builds/scratchbox into a different filesystem? (In which case your find command would automatically skip it – and therefore not hash anything there.) Presumably you were going to clobber the existing installs anyway (and therefore needed a tarball of a new root).

    But then, it probably was all fast enough for you anyway ;)

    (I wonder whether m.group(‘file’).startswith(“/builds/scratchbox/”) would be faster or slower? On the one hand, it’s not re; on the other, it doesn’t get a precompilation advantage…)

  3. Nip it in the bud; use -prune. Something like:

    find -mount -type f -exec openssl md5 ‘{}’ \; -wholename ‘/builds/scratchbox’ -prune

  4. Interesting insight on that code snippet, I would also have to agree why not just move the scratchbox into a new filesystem? It just seems more efficient to do it Mooks way.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>