Archive for December, 2012


December 27, 2012 1 comment

Some thing may only happen when you have a large amount of data. I came across this LinkedIn discussion and found one  of the linked post are very interesting as it mention the advantages brought by large data set for ‘scene completion’ in computer vision.  One of the conclusion is ” for nearest neighbor type minimization problems with a non-negative distance function (meaning that the cost function has a lower bound of zero), that distance function will, on average, decrease monotonically with data or sample size.” 

It’s also interesting that as we modify/improve approaches, models and making them complicated, it may not always perform well as compare to simple models based on the “law of large numbers.”.



Correlation & Causation

December 26, 2012 1 comment

Correlation does not necessary imply Causation. There could be multiple reason for a ‘correlation’:

– Causality: X causes Y

– Reverse Causality: Y causes X

– Simultaneity: X causes Y, Y causes X;

– Endogeneity: W causes Y, and X is correlated with W.

– Spuriousness: No causation, just a fluke.


Determining causation( from wiki)

David Hume argued that causality is based on experience, and experience similarly based on the assumption that the future models the past, which in turn can only be based on experience – leading tocircular logic. In conclusion, he asserted that causality is not based on actual reasoning: only correlation can actually be perceived.[14]

In order for a correlation to be established as causal, the cause and the effect must be connected through an impact mechanism in accordance with known laws of nature.

Intuitively, causation seems to require not just a correlation, but a counterfactual dependence. Suppose that a student performed poorly on a test and guesses that the cause was his not studying. To prove this, one thinks of the counterfactual – the same student writing the same test under the same circumstances but having studied the night before. If one could rewind history, and change only one small thing (making the student study for the exam), then causation could be observed (by comparing version 1 to version 2). Because one cannot rewind history and replay events after making small controlled changes, causation can only be inferred, never exactly known. This is referred to as the Fundamental Problem of Causal Inference – it is impossible to directly observe causal effects.[15]

A major goal of scientific experiments and statistical methods is to approximate as best as possible the counterfactual state of the world.[16] For example, one could run an experiment on identical twins who were known to consistently get the same grades on their tests. One twin is sent to study for six hours while the other is sent to the amusement park. If their test scores suddenly diverged by a large degree, this would be strong evidence that studying (or going to the amusement park) had a causal effect on test scores. In this case, correlation between studying and test scores would almost certainly imply causation.

Well-designed experimental studies replace equality of individuals as in the previous example by equality of groups. This is achieved by randomization of the subjects to two or more groups. Although not a perfect system, the likeliness of being equal in all aspects rises with the number of subjects placed randomly in the treatment/placebo groups. From the significance of the difference of the effect of the treatment vs. the placebo, one can conclude the likeliness of the treatment having a causal effect on the disease. This likeliness can be quantified in statistical terms by the P-value[dubious – discuss].

When experimental studies are impossible and only pre-existing data are available, as is usually the case for example in economics, regression analysis can be used. Factors other than the potential causative variable of interest are controlled for by including them as regressors in addition to the regressor representing the variable of interest. False inferences of causation due to reverse causation (or wrong estimates of the magnitude of causation due the presence of bidirectional causation) can be avoided by using explanators (regressors) that are necessarily exogenous, such as physical explanators like rainfall amount (as a determinant of, say, futures prices), lagged variables whose values were determined before the dependent variable’s value was determined, instrumental variables for the explanators (chosen based on their known exogeneity), etc. See Causality#Statistics and Economics. Spurious correlation due to mutual influence from a third, common, causative variable, is harder to avoid: the model must be specified such that there is a theoretical reason to believe that no such underlying causative variable has been omitted from the model; in particular, underlying time trends of both the dependent variable and the independent (potentially causative) variable must be controlled for by including time as another independent variable.



Categories: Data Mining, e-commece

machine learning by MR

December 26, 2012 Leave a comment

Holidays are always good time to slow down the pace and do some reflections. So for this time, I’m trying to randomly read something, not specific for solving any on hand problem, but just enjoy the taste of fruits from other researchers. So here are some readings for mapreduce & machine learning.

Many machine learning algorithms fit Kearnsʼ Statistical Query Model:
Linear regression, k-means, Naive Bayes, SVM, EM, PCA, backprop. These can all be written (exactly) in a summation, which leads to a linear speedup in the number of processors.
Some papers on Mapreduce:
– Map0reduce for Machine Learning on Multicore
– Mapreduce: Distributed Computing for machine Learning, 2006
– Large Language Models in Machine Translation
– Fast, easy, and cheap: construction of statistical machine translation models with mapreduce.
– Parallel implementations of word alignment tool
– Inducing Gazetteers for Named Entity Recognition by Large-scale Clustering of Dependency Relations
– Pairwise document similarity in Large Collections with Mapreduce
– Aligning needles in a Haystack: Paraphrase acquisition across the web
– Google news personalization: scalable online collaborative filtering(Assign users to clusters, and assign weights to stories based on the ratings of the users in that cluster)




December 18, 2012 Leave a comment

Very nice blog on raising the attention for ‘Similarity Measurement’

The "Putnam Program"

Similarity appears to be a notoriously inflationary concept.

Already in 1979 a presumably even incomplete catalog of similarity measures in information retrieval listed almost 70 ways to determine similarity [1]. In contemporary philosophy, however, it is almost absent as a concept, probably because it is considered merely as a minor technical aspect of empiric activities. Often it is also related to naive realism,which claimed a similarity between a physical reality and concepts. Similarity is also a central topic in cognitive psychology, yet not often discussed, probably for the same reasons as in philosophy.

In both disciplines, understanding is usually equated with drawing conclusions. Since the business of drawing conclusions and describing the kinds and surrounds of that is considered to be the subject of logic (as a discipline), it is comprehensible that logic has been rated by many practitioners and theoreticians alike as the master discipline. While there is a…

View original post 7,860 more words

Categories: Data Mining

Online auction related articles

December 10, 2012 Leave a comment

Relationship between starting price and auction outcome

Ariely and Simonsohn (2003), Haubl and Popkowski Leszczyc(2003) – find positive effect
Kamins, Dreze and Folkes (2004), Ku, Galinsky, Murnighan(2005), simonsohn and Ariely(2008) — find negative effect
Lucking-Reley, Prasad, and Reeves (2007) — find no effect

Nonrational herding (Simonsohn and Ariely 2008) – bidders favor auctions with more bids despite these extra bids arising from low starting price and not higher unobserved quality
Irrational Limited attention(Malmendier and Lee 2011) Bidders ignore conspicuous fixed-price options
Einav, Kuchler, Levin, and Sundaresan (2012) – Starting-price test assumes competing identical auctions/items except for different starting prices. In practice, this starting-price variation is hard to find.

Riley and Samuelson(1981), Virag(2010),  Adams (2010) -Potential problem if sellers set starting price as function of demand

Another approach in empirical field literature:Compare auction ending price to contemporaneous fixed prices
Malmendier and Lee(2011) – compare eBay auction prices to contemporaneous eBay BIN prices
Jones(2011) – compares eBay auction prices for gift cards to face value of gift cards.

A Book on text processing in Python

December 10, 2012 Leave a comment
Categories: NLP, Python