Thursday, March 31, 2011

A few things I learnt in 2010

2010 was an "interesting" year. I write down the lessons that I learned and felt the most.

1. "Be confident and assertive."
Fine, call yourself "data analyst", "data scientist" or even "data ninja". Whatever you call yourself really does not matter. It matters that as a professional personnel, you work in an environment that requires you (and hopefully you would like also) to share your findings and convince others to see the potential value of your analysis the way you see them. At the end of the day, you are a consultant. And as a consultant, being confident and assertive about your expertise and findings help "tons" in the process.

2.Ask a lot of questions. Especially ask the "dumb" ones at the beginning. Getting clear on concepts and terms as early as possible helps reducing confusion and shortens project development time.

3.Follow up. Sometime people forget what they said or asked for a few days ago. Reasonably constant and consistent contact is the key to keep projects running and in track! Plus, it makes your clients feel needed, which theoretically is a good feeling. :P

Monday, March 14, 2011

Filesystem Traversing with Python

Lately I have accumulated a lot of files, most of which are papers in pdf format. Although they are put into folders properly, it's still hard to track down a single file, or even to check whether I have already collected that paper or not. So I needed to do a simple filesystem traversing and figure out 4 basic things about my documents: directory, file name, size of file, and last modified date. With a summary file that containing those information, I can do some kind of basic search, which is at least faster than my going through folders and digging files.

Also I wanted my results to satisfy two requirements (1) only output file information, if that file is in pdf, ppt, or doc format; (2) the size of file should be in meaningful format. I chose Python to finish this little task. Here is the code I used


import os
import time, stat
from datetime import datetime

def sizeof_fmt(num):
for x in ['bytes','KB','MB','GB','TB']:
if num < 1024.0:
return "%3.1f%s" % (num, x)
num /= 1024.0

f = open('output.txt', 'w')

for root, dirs, files in os.walk('E:\my_papers'):
for file in files:
if file.split('.')[-1] in ('pdf','ppt','doc'):
st=os.stat(os.path.join(root,file))
sz=st[stat.ST_SIZE]
tm=time.ctime(st[stat.ST_MTIME])
tm_tmp=datetime.strptime(tm, '%a %b %d %H:%M:%S %Y')
tm=tm_tmp.strftime('%Y-%m-%d')
sz2=sizeof_fmt(sz)
strg=root+'\t'+file+'\t'+tm +'\t' + sz2 +'\n'
f.write(strg)


The modules that are used here are "os", "time", "stat" and "datetime".

First there is a user-defined function that returns the file size in human readable format. I found this interesting function here .

Next os.walk function will walk through the given directory and stop until it finds files. Then for each file, the format is obtained by splitting the file names by '.'. Once I have the file format meets my requirement, its size in bytes format and its last modified-time is collected. Unfortunately the time is a very long string, some of which is not relevant at all. So I created a "datetime" object "tm_tmp" using "datetime.striptime()", and then created a string "tm" that only keeps year, month and date information of the file. Next the size function is called and the human readable file size is returned. Finally the directory, filename, time and size information are written to file.