Archive for July, 2011

Detect Duplicate Image using Python

July 30, 2011 1 comment

To detect the exact duplicated image, using Md5 hashing (Python).

In finding the exact duplicated image, using a dictionary to store the hash value. If exact hash value is obtained, the current processing image is defined as the duplicated one. The already-read-in image is defined as ‘Representative Image

Function findDupImgMd5 returns the list of the duplicated image, and the hash results ( with image ids fall into each hash bin) .

The main program writes the list of duplicated image, the hash results into txt file. Also move a copy of the duplicated image as well as their Representive Image into the DupImg folder.


import os, glob
import Image
import hashlib
import shutil

def Imge_md5hash(im_file):
    im =
    return hashlib.md5(im.tostring()).hexdigest()

def findDupImgMd5(data_path):

    for infile in glob.glob(data_path + os.sep +"*.jpg"):
        fileNP, ext = os.path.splitext(infile)
        ids = fileNP.split(os.sep)
        hash_result = Imge_md5hash(infile)
        if hashdict.has_key(hash_result):
            hashdict.setdefault(hash_result, []).append(ids)
    return deletionList, hashdict

if __name__=="__main__":

    folderPath = r"your-image-data-folder"
    deletionList, hashdict = findDupImgMd5(folderPath)

    print("Start to save the data")

    with open("dupImageList.txt", 'wb') as ofile:
        _=[ofile.write(item+"\n") for item in deletionList]

    with open("hashDupImageDict.txt",'wb') as ofile:
        _=[ofile.write(k+"\t"+ "\t".join(v)+"\n") for (k,v) in hashdict.items()]

    """Copy the duplicated image to Dup-folder, renamed with
    hash-code and original id"""

    DesFolder = os.path.join(folderPath, "DupImg")

    if not os.path.exist(DesFolder):

    for k,v in hashdict.items():
        if len(v)>1:
            for vi in v:
                srcP = os.path.join(folderPath, vi+".jpg")
                dstP = os.path.join(DesFolder, k+"_"+vi+".jpg")


Python Parse Housing Information from redfin

July 29, 2011 Leave a comment

Write a simple code for reading the housing information from the


Two simple code to read the primary information from redfin
<td class="property_detail_label left_column">$/Sq. Ft.:</td>
<td class="property_detail_value right_column"> $216     </td>
<h1 class="adr">
        <span id="address_line_1" class="street-address"> 5160 N 1ST St </span>
        <span id="address_line_2">
        <span class="locality"> San Jose</span>,
        <span class="region">CA</span>
        <span class="postal-code">95002</span>

import urllib2
import BeautifulSoup
import pprint

def read_refinpage(urlStr):
    fileHandle = urllib2.urlopen(urlStr)
    page =

    adict['URL'] = urlStr

    start = 0
    count = 0
    class_pattern = '<td class="property_detail_label left_column">'
    value_pattern = '<td class="property_detail_value right_column">'

    while (start < len(page)) and (count<15):
        inx1_a = page.find(class_pattern, start)
        if inx1_a == -1:
        inx1_b = page.find('</td>', inx1_a)
        subj = page[inx1_a+len(class_pattern):inx1_b]
        subj = subj.replace(":", "")
        subj = subj.strip()

        inx2_a = page.find(value_pattern, inx1_b)
        inx2_b= page.find('</td>', inx2_a)
        value_temp = page[inx2_a+len(value_pattern):inx2_b]

        value = value_temp.replace('\t',"").replace('\n',"").strip()

        start = inx2_b
        if "County" in subj:
            adict[subj] = value

    return adict

def beat_read_refin(url):
    fileHandle = urllib2.urlopen(urlStr)
    page =
    soup = BeautifulSoup.BeautifulSoup(page)

    adict['URL'] = urlStr

    adict['Address'] = str(soup.findAll('span', {"class": "street-address"})[0].contents[0].strip())
    adict['Locality'] = str(soup.findAll('span', {"class": "locality"})[0].contents[0].strip())
    adict['Region'] = str(soup.findAll('span', {"class": "region"})[0].contents[0].strip())
    adict['Zip'] = str(soup.findAll('span', {"class": "postal-code"})[0].contents[0].strip())

    table = soup.find('table', id="property_basic_details")
    rows = table.findAll('tr')
    for tr in rows:
        subj = tr.findAll('td', {"class":"property_detail_label left_column"})[0].find(text=True)
        subj = subj.replace(":", "").strip()
        value = tr.findAll('td', {"class":"property_detail_value right_column"})[0].find(text=True)
        value = value.replace('\t',"").replace('\n',"").strip()
        if "County" in subj:
            adict[str(subj)] = str(value)

    return adict

if __name__=="__main__":
    urllist = ['',

    for i, url in enumerate(urllist):
        result_list[i] = beat_read_refin(url)

    for url in urllist:

    ofile  = open('housing_result.txt', "wb")
    for res in result_list:
        for k, v in res.items():
            row = k + "\t" + v + "\n"


Categories: Programming, Python

Speed in Python

July 27, 2011 Leave a comment

Some notes about optimizing the running time issue in python.


 out = "<html>" + head + prologue + query + tail + "</html>"

Instead, use

 out = "<html>%s%s%s%s</html>" % (head, prologue, query, tail)

Even better, for readability (this has nothing to do with efficiency other than yours as a programmer), use dictionary substitution:

 out = "<html>%(head)s%(prologue)s%(query)s%(tail)s</html>" % locals()



-Some fun python module:

pycallgraph is a Python module that creates call graphs for Python programs.
– profile
output a table that contains running time information for the modules.

NetworkX —  tools for analyzing network data–period
PyMC          — Bayesian/MCMC modelers
SimPy         — “Simulation in Python”,  tool set for designing experiments
SymPy        — library for symbolic mathematics.
html5lib    — web data parser
Pycluster   — clustering algorithms (efficient implementations of hierarchical and k-means clustering)
cjson           — fast JSON(JavaScript Object Notation) encoder/decoder for Python.
Pyevolve    — A complete pure python genetic algorithm framework.
MySQL for Python
RPy2           — simple Python interface for R, able to execute any R function from within a Python script.
PulP             — Optimization
Categories: Programming, Python

The Python Standard Library By Example

July 26, 2011 Leave a comment



Full information can be found: PyHowTO

Categories: Python

Chang Root Password Ubuntu

July 23, 2011 1 comment

Chang the root password

$ sudu su

#  passwd

Then the system will ask you to input your new password.

Categories: Ubuntu

Install Java-plugin for Ubuntu 11.04

July 23, 2011 2 comments

Install Java-plugin on Firfox, Ubuntu 10.4

Got a long time that the Java plugin not running well on my browser.

Here’s how to solve it:

– Open “System” –> “Administration” –> “Software Source”

– On tab “Other software”, Add

deb maverick partner

– In your command line:

sudo apt-get install sun-java6-plugin

sudo update-alternatives  –install /usr/lib/mozilla/plugins/ /usr/lib/jvm/java-6-sun/jre/lib/i386/ 1

Helpful Links

Categories: Software, Ubuntu

Weekly Fun Software Collection

July 22, 2011 Leave a comment

Several Open Source Packages, which are fun~~