Pig Error for string to long

October 19, 2016 Leave a comment

Gotten error message as:

Error: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long

The original pig script is as:


score = FOREACH score_account_kpi_avro
GENERATE FLATTEN(STRSPLIT(uid,’:’)) as (account_id:chararray,
date_sk:chararray, index:long), (double)predictedLabel, (double)predictedProb; — (xxxxxxxx,2016-09-30,221905,221905.0,221905.6822910905)

Up to this stage, if you dump some examples, it will be fine. But if proceed joining other data or computing something, you’ll get the error of “java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long”, which might be hard to tell why it happens.

What is happening here is when you do the split, you can’t cast one of the split entry into long directly (index:long). The right way  is to just get it as chararray type, and cast it in the downstrem process, for example:

score = FOREACH score GENERATE account_id AS account_id, (double)index as index,
(double)predictedLabel as predictedLabel, (double)predictedProb as predictedProb;

Categories: Uncategorized

Installing mxnet

December 23, 2015 Leave a comment

I wanted to install the newly released deep learning package “mxnet” on my mac. Here’s the instruction site: http://mxnet.readthedocs.org/en/latest/build.html#building-on-osx

It mostly comes fine, but I did have few problems including some linking error.

One is with ‘libtbb.dylib’, it keep complaining that it couldn’t find the lib, but when I check it it is in the right folder `/usr/local/lib` — which is actually a soft link to “/usr/local/Cellar/tbb/4.4-20150728/lib/”. The problem is actually because of the false configuration in opencv.pc. So what I did was to open “/usr/local/lib/pkgconfig/opencv.pc” (which provides the meta-information for pkg-config) and change -llibtbb.dylib to -ltbb.

I also got other few linking errors for libJPEG.dylib, libtiff.dylib and libpng.dylib. What I found is that they points to few libs like “/usr/local/Cellar/jpeg/8d/lib/libjpeg.dylib” or “/usr/local/Cellar/libtiff/4.0.6/lib/libtiff.dylib” but it seems that they are not the ones expected.

Screen Shot 2015-12-23 at 10.56.47 AM

Screen Shot 2015-12-23 at 10.57.30 AM

To fix this:

# creates the locate database if it does not exist, this may take a longer time, so be patient
sudo launchctl load -w /System/Library/LaunchDaemons/com.apple.locate.plist

#do locate to locate the actual lib, for example
locate libJPEG.dylib

# suppose you got the path from the above command as abspath_to_lib, if the lib already exist in /usr/local/lib, you can remove it first.
ln -s abspath_to_lib /usr/local/libJPEG.dylib

Now, you can run one mnist example by `python example/image-classification/train_mnist.py`. It should display the following results:

Screen Shot 2015-12-23 at 11.20.01 AM.png


Using spark-shell

December 23, 2015 Leave a comment

As a new learner for spark/scala, I found using spark-shell for debugging is very useful. Sometimes, I just feel it like the ipython shell.  There are few tricks of using it:

0. Do ./spark-shell -h will give you a lot of help information

1. Load external file in spark-shell:
spark-shell -i file.scala  or in-shell do
scala> :load your_path_to.scala

2. Remember when you start the shell, the SparkContext(sc) and the SQLContext (sqlContext) has already loaded. If you are not in the spark shell — remember to create such in your program

3. You can import multiple things like this: scala> import org.apache.spark.{SparkContext, SparkConf}

4. You can use `spark-shell -jars your.jar` to run a single-jar spark module from the start, and then you will be able to `import somthing_from_your_jar’ from your just added library.

5. If you install spark locally, you can open it’s web ui(port 4040) for validation purpose: http://localhost:4040/environment/

Screen Shot 2015-12-23 at 11.36.29 AM

6. To re-use what you have entered into the spark-shell, you can extract your input from the spark shell history which is in a file called “.spark-history” in the user’s home directory. For example `tail -n 5 .spark_history > mySession1.scala`. Next time, you can use (1) to reload your saved scala session. In the shell session, if you want to check history, you can simply do `scala> :history`

7. A library called scalaplot can help you to do some visual investigation.

8. Use $ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell to print launch command of spark scripts

9. Open spark-shell and execute :paste -raw that allows you to enter any valid Scala code, even including package.

ps. to install spark on your mac, you can simply use homebrew
$brew update
$brew install scala
$brew install sbt
$echo ‘SBT_OPTS=”-XX:+CMSClassUnloadingEnabled -XX:PermSize=256M -XX:MaxPermSize=512M -Xmx2G”‘ >> ~/.sbtconfig
$brew install apache-spark

After the installation, you can update your PATH variable to include the path to spark/bin.

You can also set up pyspark locally, here are some instructions: https://documentation.altiscale.com/using-spark-with-ipython

One short but nice Scala book

Categories: Scala, Spark

Few Python base Deep Learning Libs

June 23, 2015 Leave a comment

Lasagne: light weighted Theano extension, Theano can be used explicitly

Keras: is a minimalist, highly modular neural network library in the spirit of Torch, written in Python, that uses Theano under the hood for fast tensor manipulation on GPU and CPU. It was developed with a focus on enabling fast experimentation.

Pylean2: wrapper for Theano, yaml, experimental oriented.

Caffe: CNN oriented deep learning framework using c++, with python wrapper, easy model definitions using prototxt.

Theano: general gpu math

nolearn: a probably even simpler one

you can find more here.

For Lasagne and nolearn, they are still in the rapid develop stage, so they changes a lot. Be careful with the versions installed, they need to match each other. If you are having problems such as “cost must be a scalar”, you can refer link here to solve it by uninstall and reinstall them.

pip uninstall Lasagne
pip uninstall nolearn
pip install -r https://raw.githubusercontent.com/dnouri/kfkd-tutorial/master/requirements.txt

Forward to the past

June 19, 2015 Leave a comment

I was listening to Hinton’s interview (on CBC Radio: http://nvda.ly/OioP3). He mentioned multiple times of possible break through on natural language understanding by using deep learning technology. It is definitely true that human reasoning is such a difficult task to modeling as it is so complex to be abstracted easily. While I watch my little boy grows, I was amazed every time he shows a new ability, ability to do something, and ability to understand/perceive something. When training my own model (on image instead), I start to gain more understanding of the model. Structure determines the function. In most cases, the training is more like a process of “trial and error”. It’s a big black box with complex structures and connections. One of the biggest advantage of such learning network is its ability to automatically learn the representation, or say to abstract things. With abstraction in our logical system, we are able to organize things, dissect things, compose things, and possibly to create new things. Given what the network can already see/imaging (http://goo.gl/A1sL8N), it’s likely down the few years later, a network on human language could help us to translate the languages that went extinct thousands years by simply seeing over and over those scripts. This would be so wonderful cause so many ancient civilization will start shine again. Maybe I should call this “Forward to the Past”.

Categories: Uncategorized Tags: ,

Remote access ipython notebooks

February 18, 2015 1 comment

Original post: https://coderwall.com/p/ohk6cg/remote-access-to-ipython-notebooks-via-ssh

remote$ipython notebook --no-browser --port=8889

local$ssh -N -f -L localhost:8888:localhost:8889 remote_user@remote_host

To close the SSH tunnel on the local machine, look for the process and kill it manually:

local_user@local_host$ ps aux | grep localhost:8889
local_user 18418  0.0  0.0  41488   684 ?        Ss   17:27   0:00 ssh -N -f -L localhost:8888:localhost:8889 remote_user@remote_host
local_user 18424  0.0  0.0  11572   932 pts/6    S+   17:27   0:00 grep localhost:8889

local_user@local_host$ kill -15 18418

Alternatively, you can start the tunnel without the -f option. The process will then remain in the foreground and can be killed with ctrl-c.

On the remote machine, kill the IPython server with ctrl-c ctrl-c.

Note: If you are running GPU & Theano on your remote machine, you can launch the notebook by:

THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 ipython notebook –no-browser –port=8889

Another simple way is to do the following (adding ip=*):

# In the remote server

$ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 ipython notebook –no-browser –ip=* –port=7777

then you can reach the notebook from http:// the-ip-address-of-your-remote-server:7777/

Categories: Uncategorized Tags:

Few things when using Eclipse

January 13, 2015 Leave a comment

Workspace is locked.

If you encounter the situation which Eclipse says:

“Could not launch the product because the associated workspace is currently in use by another Eclipse application.” or “Workspace in use or cannot be created, chose a different one.”

Screen Shot 2015-01-13 at 11.38.30 AM

Just delete the .lock file in the .metadata directory in your eclipse workspace directory.

Install Eclipse IDE and Java/C++ development tools on Ubuntu12.04 LTS Precise Pangolin using command line

Original link: http://www.inforbiro.com/blog-eng/ubuntu-12-04-eclipse-installation/
1) Open a terminal and enter the command
sudo apt-get install eclipse-platform
2) After Eclipse is installed you can install development plugins based on your needs, e.g.:
will install Java Development Tools (JDT) package for Eclipse
sudo apt-get install eclipse-jdt
will install C/C++ development tools packages for Eclipse
sudo apt-get install eclipse-cdt

Replace tab with spaces in Eclipse CDT:

Original from here.
For CDT: Go to Window/Preference -> C/C++ -> Code Style -> Formatter -> New (create a new one because the built in profile can not be changed) -> MyProfile (choose one name for the profile) -> Indentation, Tab Policy –> Spaces only

Categories: Tools Tags: