Monday, March 14, 2011

Filesystem Traversing with Python

Lately I have accumulated a lot of files, most of which are papers in pdf format. Although they are put into folders properly, it's still hard to track down a single file, or even to check whether I have already collected that paper or not. So I needed to do a simple filesystem traversing and figure out 4 basic things about my documents: directory, file name, size of file, and last modified date. With a summary file that containing those information, I can do some kind of basic search, which is at least faster than my going through folders and digging files.

Also I wanted my results to satisfy two requirements (1) only output file information, if that file is in pdf, ppt, or doc format; (2) the size of file should be in meaningful format. I chose Python to finish this little task. Here is the code I used


import os
import time, stat
from datetime import datetime

def sizeof_fmt(num):
for x in ['bytes','KB','MB','GB','TB']:
if num < 1024.0:
return "%3.1f%s" % (num, x)
num /= 1024.0

f = open('output.txt', 'w')

for root, dirs, files in os.walk('E:\my_papers'):
for file in files:
if file.split('.')[-1] in ('pdf','ppt','doc'):
st=os.stat(os.path.join(root,file))
sz=st[stat.ST_SIZE]
tm=time.ctime(st[stat.ST_MTIME])
tm_tmp=datetime.strptime(tm, '%a %b %d %H:%M:%S %Y')
tm=tm_tmp.strftime('%Y-%m-%d')
sz2=sizeof_fmt(sz)
strg=root+'\t'+file+'\t'+tm +'\t' + sz2 +'\n'
f.write(strg)


The modules that are used here are "os", "time", "stat" and "datetime".

First there is a user-defined function that returns the file size in human readable format. I found this interesting function here .

Next os.walk function will walk through the given directory and stop until it finds files. Then for each file, the format is obtained by splitting the file names by '.'. Once I have the file format meets my requirement, its size in bytes format and its last modified-time is collected. Unfortunately the time is a very long string, some of which is not relevant at all. So I created a "datetime" object "tm_tmp" using "datetime.striptime()", and then created a string "tm" that only keeps year, month and date information of the file. Next the size function is called and the human readable file size is returned. Finally the directory, filename, time and size information are written to file.

No comments:

Post a Comment