bikle.com /Opinions/Tech Tips

Howto: AngularJS, Node.js App deployed to Heroku

Dan's AngularJS Learning Club (DALC)

Stock Market Predictions via MADlib Logistic Regression

Split Date Adjustments of Stock Market Data For Machine Learning

Stock Market Backtest with PostgreSQL 9.2 and LibSVM

Video Conferencing via Gmail

Join bikleTech!

Install Cygwin X-Windows and OpenSSH on Windows8

VirtualBox, Vagrant, Linux on Windows8

Linux 101

gem install pg: debugged

Ubuntu 12.04.3 Libraries needed by Ruby 2.0 and Rails 4.x

Selenium Can Test H2O Web UI

Understand Confusion Matrix via SQL

Call Java HelloWorld from JRuby

Deploy 0xdata H2O 10 Node Cluster on Hadoop on EC2

0xdata H2O on Linux Laptop

Stock Market/MADlib Linear Regression

Node.js For Rails on CentOS 6.4

Heroku Toolbelt

NOT IN to Left Outer Join

Install MADlib on Postgres 9.2 on CentOS 6.4

Install Postgres 9.2.4 on Ubuntu 12.04.2

Access Windows 8 BIOS on HP ENVY TS 15 Notebook PC
Page Top

Howto: AngularJS, Node.js App deployed to Heroku



Foreman




I started this effort by installing Ubuntu 12.04 Linux on my laptop. I describe the installation at the URL listed below:

VirtualBox, Vagrant, Linux on Windows8

Then, I installed a command line tool named Heroku Toolbelt. The company, heroku.com, now offers a Debian package I can use to install the Heroku Toolbelt. I was running Ubuntu, but I knew that many Debian packages will install on Ubuntu, so I tried it out. The install went smoothly:

https://toolbelt.heroku.com/debian

The above package installs Heroku client, Foreman, and Git.

Next, I used Git to get a copy of a popular AngularJS application used for training future AngularJS developers:

cd ~
git clone https://github.com/angular/angular-phonecat.git dancat2

Then, I added a simple file to be used by Foreman on Heroku:

echo 'web: node scripts/web-server.js' > ~/dancat2/Procfile

Next, I enhanced a file used by both Heroku and Node.js named ~/dancat2/package.json:


{
  "engines": {
    "node": "0.10.22"
  },
  "dependencies": {
    "logfmt": "~0.17.0",
    "express": "~3.4.6"
  },
  "name": "angularjs-seed",
  "description": "A starter project for AngularJS",
  "repository": "https://github.com/angular/angular-seed",
  "devDependencies": {
    "phantomjs": "~1.9",
    "karma": "~0.10.4",
    "karma-junit-reporter": "~0.1",
    "karma-jasmine": "~0.1.3",
    "karma-ng-scenario": "~0.1",
    "karma-chrome-launcher": "~0.1.0",
    "karma-firefox-launcher": "~0.1.0",
    "karma-html2js-preprocessor": "~0.1.0",
    "karma-requirejs": "~0.1.0",
    "karma-coffee-preprocessor": "~0.1.0",
    "karma-phantomjs-launcher": "~0.1.0",
    "karma-script-launcher": "~0.1.0"
  }
}

Then, I committed those two changes to the dancat2 git repo:

cd ~/dancat2
git add .
git commit -am dancat2-moving-towards-heroku

Next, I created an application on Heroku:

cd ~/dancat2
heroku auth:login
heroku create dancat2

Then, I deployed my local dancat2 repo into the dancat2 application on Heroku:


vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ git push heroku master
git push heroku master
Initializing repository, done.
Counting objects: 2109, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (885/885), done.
Writing objects: 100% (2109/2109), 8.83 MiB | 47 KiB/s, done.
Total 2109 (delta 1152), reused 2030 (delta 1103)

-----> Node.js app detected
-----> Requested node range:  0.10.22
-----> Resolved node version: 0.10.22
-----> Downloading and installing node
-----> Installing dependencies
       npm http GET https://registry.npmjs.org/express
       npm http GET https://registry.npmjs.org/logfmt
       npm http 200 https://registry.npmjs.org/logfmt
       npm http GET https://registry.npmjs.org/logfmt/-/logfmt-0.17.0.tgz
       npm http 200 https://registry.npmjs.org/express
       npm http GET https://registry.npmjs.org/express/-/express-3.4.7.tgz
       npm http 200 https://registry.npmjs.org/logfmt/-/logfmt-0.17.0.tgz
       npm http 200 https://registry.npmjs.org/express/-/express-3.4.7.tgz
       npm http GET https://registry.npmjs.org/connect/2.12.0
       npm http GET https://registry.npmjs.org/commander/1.3.2
       npm http GET https://registry.npmjs.org/range-parser/0.0.4
       npm http GET https://registry.npmjs.org/mkdirp/0.3.5
       npm http GET https://registry.npmjs.org/cookie/0.1.0
       npm http GET https://registry.npmjs.org/buffer-crc32/0.2.1
       npm http GET https://registry.npmjs.org/fresh/0.2.0
       npm http GET https://registry.npmjs.org/methods/0.1.0
       npm http GET https://registry.npmjs.org/send/0.1.4
       npm http GET https://registry.npmjs.org/cookie-signature/1.0.1
       npm http GET https://registry.npmjs.org/merge-descriptors/0.0.1
       npm http GET https://registry.npmjs.org/debug
       npm http GET https://registry.npmjs.org/readable-stream
       npm http GET https://registry.npmjs.org/split
       npm http GET https://registry.npmjs.org/through
       npm http 200 https://registry.npmjs.org/range-parser/0.0.4
       npm http GET https://registry.npmjs.org/range-parser/-/range-parser-0.0.4.tgz
       npm http 200 https://registry.npmjs.org/connect/2.12.0
       npm http 200 https://registry.npmjs.org/commander/1.3.2
       npm http 200 https://registry.npmjs.org/cookie/0.1.0
       npm http 200 https://registry.npmjs.org/mkdirp/0.3.5
       npm http GET https://registry.npmjs.org/connect/-/connect-2.12.0.tgz
       npm http GET https://registry.npmjs.org/commander/-/commander-1.3.2.tgz
       npm http GET https://registry.npmjs.org/cookie/-/cookie-0.1.0.tgz
       npm http GET https://registry.npmjs.org/mkdirp/-/mkdirp-0.3.5.tgz
       npm http 200 https://registry.npmjs.org/methods/0.1.0
       npm http GET https://registry.npmjs.org/methods/-/methods-0.1.0.tgz
       npm http 200 https://registry.npmjs.org/buffer-crc32/0.2.1
       npm http 200 https://registry.npmjs.org/fresh/0.2.0
       npm http 200 https://registry.npmjs.org/send/0.1.4
       npm http GET https://registry.npmjs.org/buffer-crc32/-/buffer-crc32-0.2.1.tgz
       npm http GET https://registry.npmjs.org/fresh/-/fresh-0.2.0.tgz
       npm http 200 https://registry.npmjs.org/cookie-signature/1.0.1
       npm http GET https://registry.npmjs.org/send/-/send-0.1.4.tgz
       npm http GET https://registry.npmjs.org/cookie-signature/-/cookie-signature-1.0.1.tgz
       npm http 200 https://registry.npmjs.org/merge-descriptors/0.0.1
       npm http GET https://registry.npmjs.org/merge-descriptors/-/merge-descriptors-0.0.1.tgz
       npm http 200 https://registry.npmjs.org/debug
       npm http GET https://registry.npmjs.org/debug/-/debug-0.7.4.tgz
       npm http 200 https://registry.npmjs.org/split
       npm http GET https://registry.npmjs.org/split/-/split-0.2.10.tgz
       npm http 200 https://registry.npmjs.org/range-parser/-/range-parser-0.0.4.tgz
       npm http 200 https://registry.npmjs.org/through
       npm http GET https://registry.npmjs.org/through/-/through-2.3.4.tgz
       npm http 200 https://registry.npmjs.org/mkdirp/-/mkdirp-0.3.5.tgz
       npm http 200 https://registry.npmjs.org/connect/-/connect-2.12.0.tgz
       npm http 200 https://registry.npmjs.org/commander/-/commander-1.3.2.tgz
       npm http 200 https://registry.npmjs.org/cookie/-/cookie-0.1.0.tgz
       npm http 200 https://registry.npmjs.org/methods/-/methods-0.1.0.tgz
       npm http 200 https://registry.npmjs.org/buffer-crc32/-/buffer-crc32-0.2.1.tgz
       npm http 200 https://registry.npmjs.org/fresh/-/fresh-0.2.0.tgz
       npm http 200 https://registry.npmjs.org/send/-/send-0.1.4.tgz
       npm http 200 https://registry.npmjs.org/cookie-signature/-/cookie-signature-1.0.1.tgz
       npm http 200 https://registry.npmjs.org/merge-descriptors/-/merge-descriptors-0.0.1.tgz
       npm http 200 https://registry.npmjs.org/debug/-/debug-0.7.4.tgz
       npm http 200 https://registry.npmjs.org/split/-/split-0.2.10.tgz
       npm http 200 https://registry.npmjs.org/through/-/through-2.3.4.tgz
       npm http 200 https://registry.npmjs.org/readable-stream
       npm http GET https://registry.npmjs.org/readable-stream/-/readable-stream-1.0.17.tgz
       npm http 200 https://registry.npmjs.org/readable-stream/-/readable-stream-1.0.17.tgz
       npm http GET https://registry.npmjs.org/mime
       npm http GET https://registry.npmjs.org/keypress
       npm http GET https://registry.npmjs.org/batch/0.5.0
       npm http GET https://registry.npmjs.org/qs/0.6.6
       npm http GET https://registry.npmjs.org/pause/0.0.1
       npm http GET https://registry.npmjs.org/bytes/0.2.1
       npm http GET https://registry.npmjs.org/uid2/0.0.3
       npm http GET https://registry.npmjs.org/raw-body/1.1.2
       npm http GET https://registry.npmjs.org/negotiator/0.3.0
       npm http GET https://registry.npmjs.org/multiparty/2.2.0
       npm http 200 https://registry.npmjs.org/keypress
       npm http GET https://registry.npmjs.org/keypress/-/keypress-0.1.0.tgz
       npm http 200 https://registry.npmjs.org/mime
       npm http GET https://registry.npmjs.org/mime/-/mime-1.2.11.tgz
       npm http 200 https://registry.npmjs.org/uid2/0.0.3
       npm http 200 https://registry.npmjs.org/bytes/0.2.1
       npm http GET https://registry.npmjs.org/uid2/-/uid2-0.0.3.tgz
       npm http GET https://registry.npmjs.org/bytes/-/bytes-0.2.1.tgz
       npm http 200 https://registry.npmjs.org/raw-body/1.1.2
       npm http 200 https://registry.npmjs.org/negotiator/0.3.0
       npm http GET https://registry.npmjs.org/raw-body/-/raw-body-1.1.2.tgz
       npm http GET https://registry.npmjs.org/negotiator/-/negotiator-0.3.0.tgz
       npm http 200 https://registry.npmjs.org/pause/0.0.1
       npm http 200 https://registry.npmjs.org/multiparty/2.2.0
       npm http 200 https://registry.npmjs.org/batch/0.5.0
       npm http GET https://registry.npmjs.org/pause/-/pause-0.0.1.tgz
       npm http GET https://registry.npmjs.org/multiparty/-/multiparty-2.2.0.tgz
       npm http GET https://registry.npmjs.org/batch/-/batch-0.5.0.tgz
       npm http 200 https://registry.npmjs.org/qs/0.6.6
       npm http GET https://registry.npmjs.org/qs/-/qs-0.6.6.tgz
       npm http 200 https://registry.npmjs.org/keypress/-/keypress-0.1.0.tgz
       npm http 200 https://registry.npmjs.org/mime/-/mime-1.2.11.tgz
       npm http 200 https://registry.npmjs.org/negotiator/-/negotiator-0.3.0.tgz
       npm http 200 https://registry.npmjs.org/pause/-/pause-0.0.1.tgz
       npm http 200 https://registry.npmjs.org/bytes/-/bytes-0.2.1.tgz
       npm http 200 https://registry.npmjs.org/uid2/-/uid2-0.0.3.tgz
       npm http 200 https://registry.npmjs.org/raw-body/-/raw-body-1.1.2.tgz
       npm http 200 https://registry.npmjs.org/multiparty/-/multiparty-2.2.0.tgz
       npm http 200 https://registry.npmjs.org/qs/-/qs-0.6.6.tgz
       npm http 200 https://registry.npmjs.org/batch/-/batch-0.5.0.tgz
       npm http GET https://registry.npmjs.org/stream-counter
       npm http GET https://registry.npmjs.org/readable-stream/-/readable-stream-1.1.9.tgz
       npm http 200 https://registry.npmjs.org/readable-stream/-/readable-stream-1.1.9.tgz
       npm http 200 https://registry.npmjs.org/stream-counter
       npm http GET https://registry.npmjs.org/stream-counter/-/stream-counter-0.2.0.tgz
       npm http 200 https://registry.npmjs.org/stream-counter/-/stream-counter-0.2.0.tgz
       npm http GET https://registry.npmjs.org/core-util-is
       npm http GET https://registry.npmjs.org/debuglog/0.0.2
       npm http 200 https://registry.npmjs.org/debuglog/0.0.2
       npm http 200 https://registry.npmjs.org/core-util-is
       npm http GET https://registry.npmjs.org/debuglog/-/debuglog-0.0.2.tgz
       npm http GET https://registry.npmjs.org/core-util-is/-/core-util-is-1.0.0.tgz
       npm http 200 https://registry.npmjs.org/core-util-is/-/core-util-is-1.0.0.tgz
       npm http 200 https://registry.npmjs.org/debuglog/-/debuglog-0.0.2.tgz
       logfmt@0.17.0 node_modules/logfmt
        split@0.2.10
        through@2.3.4
       readable-stream@1.0.17
       
       express@3.4.7 node_modules/express
        methods@0.1.0
        merge-descriptors@0.0.1
        range-parser@0.0.4
        cookie-signature@1.0.1
        fresh@0.2.0
        debug@0.7.4
        buffer-crc32@0.2.1
        cookie@0.1.0
        mkdirp@0.3.5
        commander@1.3.2 (keypress@0.1.0)
        send@0.1.4 (mime@1.2.11)
       connect@2.12.0 (uid2@0.0.3, pause@0.0.1, qs@0.6.6, bytes@0.2.1, raw-body@1.1.2, batch@0.5.0, negotiator@0.3.0, multiparty@2.2.0)
-----> Caching node_modules directory for future builds
-----> Cleaning up node-gyp and npm artifacts
-----> Building runtime environment
"1d4c9990-dcce-4f29-9906-fc5b3ba0a2bc"-----> Discovering process types
       Procfile declares types -> web

-----> Compressing... done, 11.6MB
-----> Launching... done, v3
       http://dancat2.herokuapp.com deployed to Heroku

To git@heroku.com:dancat2.git
 * [new branch]      master -> master
vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ 

Next, I started a web dyno on Heroku:

heroku ps
heroku ps:scale web=1
heroku ps

Then, I loaded dancat2.herokuapp.com into my browser.

I saw this:

I used the local heroku utility to inspect the remote heroku logs:


vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ heroku logs
heroku logs
2013-12-13T22:35:17.005858+00:00 heroku[api]: Enable Logplex by bikle@bikle.com
2013-12-13T22:35:17.030032+00:00 heroku[api]: Release v2 created by bikle@bikle.com
2013-12-13T22:41:25+00:00 heroku[slug-compiler]: Slug compilation started
2013-12-13T22:41:43.726637+00:00 heroku[api]: Scale to web=1 by bikle@bikle.com
2013-12-13T22:41:43.757077+00:00 heroku[api]: Deploy f84fbab by bikle@bikle.com
2013-12-13T22:41:43.783295+00:00 heroku[api]: Release v3 created by bikle@bikle.com
2013-12-13T22:41:43+00:00 heroku[slug-compiler]: Slug compilation finished
2013-12-13T22:41:46.368880+00:00 heroku[web.1]: Starting process with command `node scripts/web-server.js`
2013-12-13T22:41:47.076134+00:00 app[web.1]: Http Server running at http://localhost:8000/
2013-12-13T22:41:47.728440+00:00 heroku[web.1]: Starting process with command `node scripts/web-server.js`
2013-12-13T22:41:48.988574+00:00 app[web.1]: Http Server running at http://localhost:8000/
2013-12-13T22:42:46.807753+00:00 heroku[web.1]: Error R10 (Boot timeout) -> Web process failed to bind to $PORT within 60 seconds of launch
2013-12-13T22:42:46.807922+00:00 heroku[web.1]: Stopping process with SIGKILL
2013-12-13T22:42:47.873382+00:00 heroku[web.1]: Process exited with status 137
2013-12-13T22:42:47.891379+00:00 heroku[web.1]: State changed from crashed to starting
2013-12-13T22:42:47.891379+00:00 heroku[web.1]: State changed from starting to crashed
2013-12-13T22:42:47.933175+00:00 heroku[web.1]: Stopping process with SIGKILL
2013-12-13T22:42:47.932942+00:00 heroku[web.1]: Error R10 (Boot timeout) -> Web process failed to bind to $PORT within 60 seconds of launch
2013-12-13T22:42:49.113318+00:00 heroku[web.1]: Process exited with status 137
2013-12-13T22:42:49.930144+00:00 heroku[web.1]: Starting process with command `node scripts/web-server.js`
2013-12-13T22:42:50.852794+00:00 app[web.1]: Http Server running at http://localhost:8000/
2013-12-13T22:43:50.301816+00:00 heroku[web.1]: Stopping process with SIGKILL
2013-12-13T22:43:50.301599+00:00 heroku[web.1]: Error R10 (Boot timeout) -> Web process failed to bind to $PORT within 60 seconds of launch
2013-12-13T22:43:51.644974+00:00 heroku[web.1]: State changed from starting to crashed
2013-12-13T22:43:51.629994+00:00 heroku[web.1]: Process exited with status 137
2013-12-13T22:50:05.010005+00:00 heroku[router]: at=error code=H10 desc="App crashed" method=GET path=/ host=dancat2.herokuapp.com fwd="75.63.17.216" dyno= connect= service= status=503 bytes=
2013-12-13T22:50:05.600056+00:00 heroku[router]: at=error code=H10 desc="App crashed" method=GET path=/favicon.ico host=dancat2.herokuapp.com fwd="75.63.17.216" dyno= connect= service= status=503 bytes=
2013-12-13T22:50:05.708417+00:00 heroku[router]: at=error code=H10 desc="App crashed" method=GET path=/favicon.ico host=dancat2.herokuapp.com fwd="75.63.17.216" dyno= connect= service= status=503 bytes=
2013-12-13T22:51:56.114223+00:00 heroku[router]: at=error code=H10 desc="App crashed" method=GET path=/ host=dancat2.herokuapp.com fwd="75.63.17.216" dyno= connect= service= status=503 bytes=
vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ 

I Google-searched for this string:

Error R10 (Boot timeout) -> Web process failed to bind to $PORT within 60 seconds of launch

I found an answer:

http://stackoverflow.com/questions/15693192/...

I enhanced ~/dancat2/scripts/web-server.js

Before:


function main(argv) {
  new HttpServer({
    'GET': createServlet(StaticServlet),
    'HEAD': createServlet(StaticServlet)
  }).start(Number(argv[2]) || DEFAULT_PORT);
}

After:


function main(argv) {
  new HttpServer({
    'GET': createServlet(StaticServlet),
    'HEAD': createServlet(StaticServlet)
  }).start(process.env.PORT || Number(argv[2]) || DEFAULT_PORT);
}

I committed and pushed again:


vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ git add .
vagrant@precise64:~/dancat2$ git commit -m process.env.PORT-helpful-on-heroku
[master a1a3b5c] process.env.PORT-helpful-on-heroku
 1 file changed, 1 insertion(+), 1 deletion(-)
vagrant@precise64:~/dancat2$ 

vagrant@precise64:~/dancat2$ git push heroku master
git push heroku master
Fetching repository, done.
Counting objects: 7, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 390 bytes, done.
Total 4 (delta 3), reused 0 (delta 0)

-----> Node.js app detected
-----> Requested node range:  0.10.22
-----> Resolved node version: 0.10.22
-----> Downloading and installing node
-----> Restoring node_modules directory from cache
-----> Pruning cached dependencies not specified in package.json
-----> Installing dependencies
-----> Caching node_modules directory for future builds
-----> Cleaning up node-gyp and npm artifacts
-----> Building runtime environment
"b750ada0-9e3d-41fa-a1e0-80a9e810e9e2"-----> Discovering process types
       Procfile declares types -> web

-----> Compressing... done, 11.6MB
-----> Launching... done, v4
       http://dancat2.herokuapp.com deployed to Heroku

To git@heroku.com:dancat2.git
   f84fbab..a1a3b5c  master -> master
vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ 
vagrant@precise64:~/dancat2$ 

I tested the URL with my browser and it worked. Yay!

http://dancat2.herokuapp.com/app/index.html




Page Top

Stock Market Predictions via MADlib Logistic Regression



This tech-tip describes a technique to apply logistic regression software towards the task of predicting three day direction of the stock market. A general discussion about logistic regression can be found on Wikipedia:

http://en.wikipedia.org/wiki/Logistic_regression

In statistics class I learned about an idea called 'Mean Reversion' (MR).

https://www.google.com/search?q=reversion+to+mean

Two obvious questions I have are, does MR affect the stock market? And, if so, is it strong enough to be lucrative? I used stock price data from Yahoo and SQL to answer these questions. I downloaded the data into a table named ydata. I summarize the data below:


SELECT COUNT(tkr) FROM ydata;

 count  
--------
 242772


SELECT tkr
, MIN(ydate),COUNT(tkr),MAX(ydate) 
, MIN(closing_price_orig),MAX(closing_price_orig) 
FROM ydata 
GROUP BY tkr ORDER BY tkr
;

 tkr |    min     | count |    max     |  min  |  max   
-----+------------+-------+------------+-------+--------
 BA  | 1962-01-02 | 13069 | 2013-11-29 | 11.63 | 180.25
 CAT | 1962-01-02 | 13069 | 2013-11-29 | 22.17 | 120.75
 CVX | 1970-01-02 | 11082 | 2013-11-29 | 20.37 | 127.76
 DD  | 1962-01-02 | 13068 | 2013-11-29 | 16.14 | 293.75
 DIS | 1962-01-02 | 13069 | 2013-11-29 | 13.77 | 244.00
 ED  | 1970-01-02 | 11082 | 2013-11-29 |  6.13 |  64.94
 GE  | 1962-01-02 | 13069 | 2013-11-29 |  6.66 | 166.00
 HON | 1970-01-02 | 11082 | 2013-11-29 | 16.12 |  93.69
 HPQ | 1962-01-02 | 13069 | 2013-11-29 | 11.16 | 156.00
 IBM | 1962-01-02 | 13069 | 2013-11-29 | 41.00 | 649.00
 JNJ | 1970-01-02 | 11082 | 2013-11-29 | 28.25 | 177.25
 KO  | 1962-01-02 | 13069 | 2013-11-29 | 28.87 | 155.75
 MCD | 1970-01-02 | 11082 | 2013-11-29 | 12.38 | 122.50
 MMM | 1970-01-02 | 11082 | 2013-11-29 | 41.83 | 158.00
 MO  | 1970-01-02 | 11082 | 2013-11-29 | 14.45 | 180.00
 MRK | 1970-01-02 | 11082 | 2013-11-29 | 20.99 | 217.50
 MRO | 1970-01-02 | 11076 | 2013-11-29 |  5.75 | 132.51
 NAV | 1970-01-02 | 11076 | 2013-11-29 |  1.75 |  78.99
 PG  | 1970-01-02 | 11082 | 2013-11-29 | 40.25 | 153.50
 SPY | 1993-01-29 |  5249 | 2013-11-29 | 43.41 | 181.12
 XOM | 1970-01-02 | 11082 | 2013-11-29 | 25.00 | 109.37
(21 rows)

I have over 242,000 rows of daily closing prices for 20 well known tickers.

Some of the prices go back to 1962.

After I removed stock-split discontinuities from ydata, I created five calculations of percentage gain of each ticker for each day.

I name these calculations ng1, ng2, ng3, ng4, and ng5.

For example ng3 is the three day percentage gain divided by 100.

I call ng3 'normalized 3 day gain'.

I asked, what is the average value of ng1,2,3,4,5?

I answered via SQL:


SELECT 
AVG(ng3) avg_ng3
FROM tv20st
;

          avg_ng3           
----------------------------
 0.001419360793550421545907

On average, over the past many years, a ticker gained 0.14% every 3 days.

This is equivalent to a gain of $1.40 on $1,000 worth of stock.

Given 10 or 20 years, that can add up.

How about the S&P 500?


SELECT 
AVG(ng3)avg_ng3
FROM tv20st
WHERE tkr = 'SPY'
;

          avg_ng3           
----------------------------
 0.001002232292755480081485

On average, going back to 1993, the S&P 500 gained 0.10% every 3 days.

I keep in mind that most mutual funds and money managers return less than the S&P 500.

Returning my attention to MR, I asked, "When a ticker price moves a large amount over a short duration, does it usually Revert to Mean?"

To answer this, I collected more calculations.

For each ticker on every day I collected measurements of previous day price movements:

  • 2 Day Moving Average Slope
  • 3 Day Moving Average Slope
  • 4 Day Moving Average Slope
  • 5 Day Moving Average Slope
  • 9 Day Moving Average Slope
  • 20 Day Moving Average Slope
  • 40 Day Moving Average Slope
  • 80Day Moving Average Slope
  • 200 Day Moving Average Slope

Next I wrote an English question which is easily translated to SQL.

English:

Select
The average ng1, ng2, ng3, ng4, ng5
From my data
Where the 9 Day Moving Average Slope
Is extremely negative

SQL:


SELECT 
COUNT(tkr)
,ROUND( AVG(ng1), 4) avg_ng1
,ROUND( AVG(ng2), 4) avg_ng2
,ROUND( AVG(ng3), 4) avg_ng3
,ROUND( AVG(ng4), 4) avg_ng4
,ROUND( AVG(ng5), 4) avg_ng5
FROM tv14
WHERE ma09s < (SELECT -1.5*STDDEV(ma09s) FROM tv14)
;

 count | avg_ng1 | avg_ng2 | avg_ng3 | avg_ng4 | avg_ng5 
-------+---------+---------+---------+---------+---------
 12137 |  0.0022 |  0.0041 |  0.0055 |  0.0066 |  0.0074

It is obvious from the above query that if 9 Day Moving Average Slope was extremely negative, then avg_ng3 was 0.0055.

This value of 0.0055 is about 3.9 times greater than the average of 0.0014 displayed above.

I saw this as solid evidence that MR affects the stock market when the 9 Day Moving Average Slope is extremely negative.

I display results for extremely positive 9 Day Moving Average Slopes:

SELECT 
COUNT(tkr)
,ROUND( AVG(ng1), 4) avg_ng1
,ROUND( AVG(ng2), 4) avg_ng2
,ROUND( AVG(ng3), 4) avg_ng3
,ROUND( AVG(ng4), 4) avg_ng4
,ROUND( AVG(ng5), 4) avg_ng5
FROM tv14
WHERE ma09s > (SELECT 1.5*STDDEV(ma09s) FROM tv14)
;
 count | avg_ng1 | avg_ng2 | avg_ng3 | avg_ng4 | avg_ng5 
-------+---------+---------+---------+---------+---------
 13487 | -0.0006 | -0.0008 | -0.0006 | -0.0003 |  0.0002

I also saw this as solid evidence that MR affects the stock market, but the behavior is muted.

Now, I have the answer to my second question posed at the top of this post:

MR is probably only lucrative when the market makes strong moves DOWNWARD over short durations.

And I have an observation for a specific, lucrative situation:

If the market has a negative 9 Day Moving Average Slope
which is less than 1.5 times the
Standard Deviation of the 9 Day Moving Average Slope,
then the average percentage gain over the next 3 days is 0.55%
which is about 3.9 times greater than the average of 0.14%.

Obviously I have other ways to measure downward movement of the market, 5 Day Moving Average Slope for example:


SELECT 
COUNT(tkr)
,ROUND( AVG(ng1), 4) avg_ng1
,ROUND( AVG(ng2), 4) avg_ng2
,ROUND( AVG(ng3), 4) avg_ng3
,ROUND( AVG(ng4), 4) avg_ng4
,ROUND( AVG(ng5), 4) avg_ng5
FROM tv14
WHERE ma05s < -(SELECT 1.5*STDDEV(ma05s) FROM tv14)
;

 count | avg_ng1 | avg_ng2 | avg_ng3 | avg_ng4 | avg_ng5 
-------+---------+---------+---------+---------+---------
 12290 |  0.0024 |  0.0046 |  0.0063 |  0.0078 |  0.0090

It seems that 5 day moving average slope is even more predictive than 9 day moving average slope.

I asked, can I use a systematic way to find an optimal downward-movement-measurement, which signals future-lucrative 3-day holding periods?

The general answer is yes; I build vectors from ng3, slopes of moving averages, and other price-action measurements, and then feed the vectors to machine learning software to predict the sign of ng3.

So, this tech-tip is about using the Logistic Regression module in MADlib Machine Learning software to predict future-lucrative 3-day holding periods from Yahoo stock market data.

I approached this task by first installing Centos 6.4 on a Windows8 laptop inside VirtualBox via Vagrant.

A general discussion about Windows8, VirtualBox, and Vagrant can be found in a tech-tip I wrote recently:

http://bikle.com/techtips/cygwin#vbox

After Centos was ready, I installed Postgres 9.2 and MADlib 1.3.

To do that I followed instructions in a a tech-tip I wrote recently:

http://www.bikle.com/techtips/two#madlib

The above tech-tip describes installation on a cloud server. I encourage you to try that route if you lack the patience to wrestle with a Windows8 laptop.

Then, inside my home-dir I created a dir named 'tv':

mkdir ~/tv/

tv is an acronym for Ticker Vector.

Next, I used a simple script to download ticker-price data from Yahoo:

https://github.com/danbikle/stk_mkt_madlib_logregr/blob/master/wget_ydata.bash

It placed some CSV data in /tmp/ydata/


/tmp/ydata:
drwxr-xr-x   2 dan  dan      4096 Dec  1 00:45 .
-rw-rw-r--   1 dan  dan    632972 Dec  1 00:42 BA.csv
-rw-rw-r--   1 dan  dan    628196 Dec  1 00:42 CAT.csv
-rw-rw-r--   1 dan  dan    538180 Dec  1 00:42 CVX.csv
-rw-rw-r--   1 dan  dan    646825 Dec  1 00:42 DD.csv
-rw-rw-r--   1 dan  dan    640520 Dec  1 00:42 DIS.csv
-rw-rw-r--   1 dan  dan    526305 Dec  1 00:42 ED.csv
-rw-rw-r--   1 dan  dan    643640 Dec  1 00:42 GE.csv
-rw-rw-r--   1 dan  dan    531530 Dec  1 00:42 HON.csv
-rw-rw-r--   1 dan  dan    635981 Dec  1 00:42 HPQ.csv
-rw-rw-r--   1 dan  dan    672226 Dec  1 00:42 IBM.csv
-rw-rw-r--   1 dan  dan    541351 Dec  1 00:42 JNJ.csv
-rw-rw-r--   1 dan  dan    640120 Dec  1 00:42 KO.csv
-rw-rw-r--   1 dan  dan    537778 Dec  1 00:42 MCD.csv
-rw-rw-r--   1 dan  dan    542805 Dec  1 00:42 MMM.csv
-rw-rw-r--   1 dan  dan    539932 Dec  1 00:42 MO.csv
-rw-rw-r--   1 dan  dan    545754 Dec  1 00:42 MRK.csv
-rw-rw-r--   1 dan  dan    521968 Dec  1 00:42 MRO.csv
-rw-rw-r--   1 dan  dan    514894 Dec  1 00:42 NAV.csv
-rw-rw-r--   1 dan  dan    540479 Dec  1 00:43 PG.csv
-rw-rw-r--   1 dan  dan    277346 Dec  1 00:44 SPY.csv
-rw-rw-r--   1 dan  dan    540494 Dec  1 00:43 XOM.csv

Then, I loaded the above CSV data into Postgres using a shell script named load_ydata.bash:

https://github.com/danbikle/stk_mkt_madlib_logregr/blob/master/postgres/load_ydata.bash

Next, I removed data corruptions in the CSV data caused by events called stock splits.

I wrote a short tech-tip about dealing with stock splits recently:

http://www.bikle.com/techtips/libsvm#splitdate

The above tech-tip describes the technique for just 4 tickers.

In this effort I had 21 tickers and used a shell script:

https://github.com/danbikle/stk_mkt_madlib_logregr/blob/master/cr_upd_cp.bash

to generate a SQL script:

https://github.com/danbikle/stk_mkt_madlib_logregr/blob/master/update_closing_price.sql

Next, I ran the above SQL script.

Then, I ran scripts to build vectors for MADlib. After the vectors, I built an in-sample set of vectors older than 2010-01-01.

The above two steps are done in one shell script:

https://github.com/danbikle/stk_mkt_madlib_logregr/blob/master/many_mad12.bash

https://github.com/danbikle/stk_mkt_madlib_logregr/blob/master/cr_logreg_vectors.bash

https://github.com/danbikle/stk_mkt_madlib_logregr/blob/master/cr_logreg_vectors.sql

Next, I ran scripts which learn from the in-sample vectors and then predict future prices of out-of-sample vectors.


./logreg_predict.bash  SPY 2010-01-01
./logreg_predict.bash  BA  2010-01-01
./logreg_predict.bash  CAT 2010-01-01
./logreg_predict.bash  CVX 2010-01-01
./logreg_predict.bash  DD  2010-01-01
./logreg_predict.bash  DIS 2010-01-01
./logreg_predict.bash  ED  2010-01-01
./logreg_predict.bash  GE  2010-01-01
./logreg_predict.bash  HON 2010-01-01
./logreg_predict.bash  HPQ 2010-01-01
./logreg_predict.bash  IBM 2010-01-01
./logreg_predict.bash  JNJ 2010-01-01
./logreg_predict.bash  KO  2010-01-01
./logreg_predict.bash  MCD 2010-01-01
./logreg_predict.bash  MMM 2010-01-01
./logreg_predict.bash  MO  2010-01-01
./logreg_predict.bash  MRK 2010-01-01
./logreg_predict.bash  MRO 2010-01-01
./logreg_predict.bash  NAV 2010-01-01
./logreg_predict.bash  PG  2010-01-01
./logreg_predict.bash  XOM 2010-01-01

The predictions are captured in table named predictions12.

I report on predictions12 using:

https://github.com/danbikle/stk_mkt_madlib_logregr/blob/master/qry_predictions12.sql

Results of the report are displayed below:


dan@madlibsvm ~/tv $ 
dan@madlibsvm ~/tv $ 
dan@madlibsvm ~/tv $ 
dan@madlibsvm ~/tv $ ./psqlmad.bash -f qry_predictions12.sql
--
-- ~/tv/qry_predictions12.sql
--
-- Look at Bearish predictions:
SELECT 
COUNT(tkr)                     ccount
,ROUND(AVG(ng3),4)             avg_ng3
,ROUND(SUM(ng3),4)             sum_ng3
,ROUND(AVG(ng3)/STDDEV(ng3),4) sharpe_r 
FROM predictions12 
WHERE prob<0.45
;
 ccount | avg_ng3 | sum_ng3 | sharpe_r 
--------+---------+---------+----------
    543 |  0.0014 |  0.7636 |   0.0376
(1 row)

-- Look at Bullish predictions:
SELECT 
COUNT(tkr)                     ccount
,ROUND(AVG(ng3),4)             avg_ng3
,ROUND(SUM(ng3),4)             sum_ng3
,ROUND(AVG(ng3)/STDDEV(ng3),4) sharpe_r 
FROM predictions12 
WHERE prob>0.55
;
 ccount | avg_ng3 | sum_ng3 | sharpe_r 
--------+---------+---------+----------
    561 |  0.0074 |  4.1264 |   0.1783
(1 row)

-- Look at Bearish predictions for each ticker:
SELECT 
tkr
,COUNT(tkr)                    ccount
,ROUND(AVG(ng3),4)             avg_ng3
,ROUND(SUM(ng3),4)             sum_ng3
,ROUND(AVG(ng3)/STDDEV(ng3),4) sharpe_r 
FROM predictions12 
WHERE prob<0.45 
GROUP BY tkr 
HAVING COUNT(tkr) > 9 
ORDER BY SUM(ng3)
;
 tkr | ccount | avg_ng3 | sum_ng3 | sharpe_r 
-----+--------+---------+---------+----------
 HON |     23 | -0.0083 | -0.1902 |  -0.2784
 BA  |     42 | -0.0027 | -0.1118 |  -0.1233
 MRK |     12 | -0.0039 | -0.0472 |  -0.2004
 GE  |     31 | -0.0014 | -0.0426 |  -0.0680
 DIS |     26 | -0.0014 | -0.0375 |  -0.0616
 IBM |     10 | -0.0022 | -0.0216 |  -0.1350
 XOM |     12 | -0.0014 | -0.0164 |  -0.0800
 NAV |    127 |  0.0004 |  0.0485 |   0.0065
 MRO |     60 |  0.0019 |  0.1111 |   0.0494
 CVX |     16 |  0.0070 |  0.1122 |   0.5853
 HPQ |     87 |  0.0014 |  0.1243 |   0.0460
 DD  |     34 |  0.0070 |  0.2383 |   0.2706
 CAT |     51 |  0.0132 |  0.6723 |   0.4724
(13 rows)

-- Look at Bullish predictions for each ticker:
SELECT 
tkr
,COUNT(tkr)                    ccount
,ROUND(AVG(ng3),4)             avg_ng3
,ROUND(SUM(ng3),4)             sum_ng3
,ROUND(AVG(ng3)/STDDEV(ng3),4) sharpe_r 
FROM predictions12 
WHERE prob>0.55
GROUP BY tkr 
HAVING COUNT(tkr) > 9 
ORDER BY SUM(ng3)
;
 tkr | ccount | avg_ng3 | sum_ng3 | sharpe_r 
-----+--------+---------+---------+----------
 MRK |     22 | -0.0049 | -0.1082 |  -0.1302
 CVX |     36 |  0.0001 |  0.0053 |   0.0036
 XOM |     26 |  0.0006 |  0.0158 |   0.0149
 HPQ |     43 |  0.0020 |  0.0846 |   0.0695
 MRO |     48 |  0.0031 |  0.1465 |   0.0644
 MMM |     22 |  0.0090 |  0.1976 |   0.3062
 SPY |     18 |  0.0137 |  0.2465 |   0.4053
 GE  |     42 |  0.0061 |  0.2583 |   0.1642
 DIS |     29 |  0.0102 |  0.2949 |   0.2414
 CAT |     57 |  0.0056 |  0.3208 |   0.1268
 BA  |     30 |  0.0109 |  0.3281 |   0.2243
 IBM |     24 |  0.0172 |  0.3949 |   0.5721
 NAV |     67 |  0.0075 |  0.4995 |   0.1288
 DD  |     39 |  0.0152 |  0.5931 |   0.4670
 HON |     28 |  0.0226 |  0.6327 |   0.5190
(15 rows)

-- Classic Accuracy Calculation.
-- True Positives + True Negatives here:
SELECT 
COUNT(tkr) ccount 
FROM predictions12 
WHERE SIGN(prob-0.5) = SIGN(ng3)
;
 ccount 
--------
   6963
(1 row)

-- All observations
SELECT COUNT(tkr) FROM predictions12;
 count 
-------
 13902
(1 row)

-- Accuracy is Count(True Positives) + Count(True Negatives) / Count(All observations)
-- Usually it is near 50%.
-- Confusion Matrix Calculations.
-- True Positives:
SELECT COUNT(tkr) ccount FROM predictions12 WHERE prob>0.5 AND ng3>0;
 ccount 
--------
   3234
(1 row)

-- True Negatives:
SELECT COUNT(tkr) ccount FROM predictions12 WHERE prob<0.5 AND ng3<0;
 ccount 
--------
   3729
(1 row)

-- False Positives:
SELECT COUNT(tkr) ccount FROM predictions12 WHERE prob>0.5 AND ng3<0;
 ccount 
--------
   2521
(1 row)

-- False Negatives:
SELECT COUNT(tkr) ccount FROM predictions12 WHERE prob<0.5 AND ng3>0;
 ccount 
--------
   4329
(1 row)

-- end
dan@madlibsvm ~/tv $ 
dan@madlibsvm ~/tv $ 
dan@madlibsvm ~/tv $ 
dan@madlibsvm ~/tv $ 



Results Discussion and Interpretation

I start the discussion by directing your attention to the query displayed below:


SELECT 
AVG(ng3) avg_ng3
FROM tv20st
;

          avg_ng3           
----------------------------
 0.001419360793550421545907

The query tells me that the average normalized 3-day gain for the 21 tickers in this data set is 0.0014 which is 0.14%.

The Logistic Regression software described here issues two types of predictions, Bullish and Bearish.

The query displayed below shows the results of the Bearish predictions:


dan@madlibsvm ~/tv $ 
dan@madlibsvm ~/tv $ 
dan@madlibsvm ~/tv $ ./psqlmad.bash -f qry_predictions12.sql
--
-- ~/tv/qry_predictions12.sql
--
SELECT 
COUNT(tkr)                     ccount
,ROUND(AVG(ng3),4)             avg_ng3
,ROUND(SUM(ng3),4)             sum_ng3
,ROUND(AVG(ng3)/STDDEV(ng3),4) sharpe_r 
FROM predictions12 
WHERE prob<0.45
;
 ccount | avg_ng3 | sum_ng3 | sharpe_r 
--------+---------+---------+----------
    543 |  0.0014 |  0.7636 |   0.0376
(1 row)

I can see that the Bearish predictions resulted in an average gain of 0.14%.

This was equivalent to the average gain for all the observations. This signifies that the Logistic Regression model I built, lacks the ability to predict downward 3 day movements on this data set.

I suspect that the reason for this is that the model learned from data between 1962 and 2010. This in-sample data set was fundamentally different than the out-of-sample data set which contains observations for the years 2010, 2011, 2012, and most of 2013. Stocks during this more recent period displayed unusual upward movement, possibly due to Fed policy. On 2010-01-04, SPY opened at 112. On 2013-11-29, SPY closed at 181. This is a gain of 62% which unusually large.

The query displayed below shows the results of the Bullish predictions:


SELECT 
COUNT(tkr)                     ccount
,ROUND(AVG(ng3),4)             avg_ng3
,ROUND(SUM(ng3),4)             sum_ng3
,ROUND(AVG(ng3)/STDDEV(ng3),4) sharpe_r 
FROM predictions12 
WHERE prob>0.55
;
 ccount | avg_ng3 | sum_ng3 | sharpe_r 
--------+---------+---------+----------
    561 |  0.0074 |  4.1264 |   0.1783

I can see that the Bullish predictions resulted in an average gain of 0.74% which is over 5 times larger than the average of 0.14%. This is an excellent result and convinces me that a Logistic Regression model built from slopes of moving averages yields significant predictive power.




Page Top

Split Date Adjustments of Stock Market Data For Machine Learning



This tech-tip demonstrates a technique to adjust closing price data for a stock after that stock 'splits'. Making this adjustment allows me to keep my moving average calculations accurate. These calculations are critical to teaching a machine learning algorithm about future price directions. For example, on 1999-06-21 Cisco (symbol: CSCO) split its stock. According to the Yahoo URL for CSCO I can see this:

http://finance.yahoo.com/q/hp?s=CSCO+Historical+Prices

On 1999-06-21 CSCO closed at 123.12 and then the next trading day, 1999-06-22, CSCO opened at 61.00. So, to keep the moving average calculation accurate, I adjust my CSCO price data by multiplying closing prices after the split date by 2. The SQL I use to adjust the data can be described with the statement below:

UPDATE ydata
SET closing_price = closing_price * 2
WHERE tkr = 'CSCO'
AND ydate > '1999-06-21';

I provide you some context for the above statement by descibing how I populate the ydata table. This is a multistep process starting with a wget-script.

The wget-script I display below is a simple demo. When I get serious about machine learning stock data, I use something a bit more elaborate which loops through a list of several hundred ticker symbols and wgets price data for each symbol.

#!/bin/bash
# ~/sa/wget_ydata.bash
# I use this script to wget some yahoo stock prices.
mkdir -p /tmp/ydata/
cd       /tmp/ydata/
wget --output-document=DD.csv  http://ichart.finance.yahoo.com/table.csv?s=DD
wget --output-document=DIS.csv http://ichart.finance.yahoo.com/table.csv?s=DIS
wget --output-document=IBM.csv http://ichart.finance.yahoo.com/table.csv?s=IBM
wget --output-document=KO.csv  http://ichart.finance.yahoo.com/table.csv?s=KO
exit

A screen dump of me running the above script is displayed below:

dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ ./wget_ydata.bash 
--2013-11-28 20:33:40--  http://ichart.finance.yahoo.com/table.csv?s=DD
Resolving ichart.finance.yahoo.com (ichart.finance.yahoo.com)... 67.195.146.181
Connecting to ichart.finance.yahoo.com (ichart.finance.yahoo.com)|67.195.146.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: `DD.csv'

    [           <=>                         ] 646,776      311K/s   in 2.0s    

2013-11-28 20:33:42 (311 KB/s) - `DD.csv' saved [646776]

--2013-11-28 20:33:42--  http://ichart.finance.yahoo.com/table.csv?s=DIS
Resolving ichart.finance.yahoo.com (ichart.finance.yahoo.com)... 67.195.146.181
Connecting to ichart.finance.yahoo.com (ichart.finance.yahoo.com)|67.195.146.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: `DIS.csv'

    [            <=>                        ] 640,471      251K/s   in 2.5s    

2013-11-28 20:33:45 (251 KB/s) - `DIS.csv' saved [640471]

--2013-11-28 20:33:45--  http://ichart.finance.yahoo.com/table.csv?s=IBM
Resolving ichart.finance.yahoo.com (ichart.finance.yahoo.com)... 67.195.146.181
Connecting to ichart.finance.yahoo.com (ichart.finance.yahoo.com)|67.195.146.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: `IBM.csv'

    [              <=>                      ] 672,172      227K/s   in 2.9s    

2013-11-28 20:33:49 (227 KB/s) - `IBM.csv' saved [672172]

--2013-11-28 20:33:49--  http://ichart.finance.yahoo.com/table.csv?s=KO
Resolving ichart.finance.yahoo.com (ichart.finance.yahoo.com)... 67.195.146.181
Connecting to ichart.finance.yahoo.com (ichart.finance.yahoo.com)|67.195.146.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: `KO.csv'

    [          <=>                          ] 640,071      311K/s   in 2.0s    

2013-11-28 20:33:51 (311 KB/s) - `KO.csv' saved [640071]

dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ 

After I run the above script, I work towards loading the data into a table using either Postgres or Oracle. I prefer to work with Postgres unless I need to access the SVM algorithm inside of Oracle Data Mining.

This tech-tip describes working with Postgres. If you want to work with Oracle, refer to the scripts in the git-repo:

https://github.com/danbikle/split_adjuster/tree/master/oracle

Before I load the CSV data into Postgres, I create a table using syntax displayed below:

--
-- ~/sa/postgres/cr_ydata.sql
--
--
-- I use this script to create table ydata which holds price data from
-- yahoo.  Note that this Postgres syntax is different than the syntax
-- I use for Oracle.
--
--
CREATE TABLE ydata
(
tkr VARCHAR(9)
,ydate   DATE
,opn     DECIMAL
,mx      DECIMAL
,mn      DECIMAL
,closing_price DECIMAL
,vol     DECIMAL
,adjclse DECIMAL
,closing_price_orig DECIMAL
)
;

Next, I depend on a script to load the CSV data into Postgres:


#!/bin/bash

# ~/sa/postgres/load_ydata.bash

# I use this script to load data from CSV files into table, ydata.

# This script is only a demo.
# If I want to load hundreds of tickers, instead of just four,
# I use SQL to build a script and then I run that.

# I already ran the wget script below so I comment it out.
# You might find it useful to run it here:
# ~/sa/wget_ydata.bash
# If you did run the above wget script,
# you should have four CSV files in /tmp/ydata/

# I add tkr values to the CSV data and create one large CSV file.

rm -f /tmp/ydata/ydata.csv
grep -v Date /tmp/ydata/DD.csv  | sed '1,$s/^/DD,/'  >> /tmp/ydata/ydata.csv
grep -v Date /tmp/ydata/DIS.csv | sed '1,$s/^/DIS,/' >> /tmp/ydata/ydata.csv
grep -v Date /tmp/ydata/IBM.csv | sed '1,$s/^/IBM,/' >> /tmp/ydata/ydata.csv
grep -v Date /tmp/ydata/KO.csv  | sed '1,$s/^/KO,/'  >> /tmp/ydata/ydata.csv

echo 'Here is head and tail of the CSV file I want to load:'
head -3 /tmp/ydata/ydata.csv
tail -3 /tmp/ydata/ydata.csv

# Time for me to call psql which calls the COPY command to copy
# rows out of /tmp/ydata/ydata.csv
# into the table, ydata.

# Ensure that postgres server can read the data:
chmod 755 /tmp/ydata/
chmod 644 /tmp/ydata/ydata.csv

# I assume postgres authentication is setup.
# I follow clues I wrote here: 
# ~/sa/postgres/readme_authentication.txt

echo 'I might see an error here:'
echo 'ERROR:  relation "ydata" already exists'
echo 'It is okay. I need to ensure that ydata exists'
echo 'before I try to fill it.'
psql<<EOF
-- Ensure that ydata exists.
-- I might get an error here:
\i cr_ydata.sql 

-- Assume current data in ydata is not needed.
-- Since this is only a demo, I can toss it in the trash:

TRUNCATE TABLE ydata;

-- Now fill ydata

COPY ydata (
tkr
,ydate     
,opn      
,mx
,mn
,closing_price
,vol
,adjclse
) FROM '/tmp/ydata/ydata.csv' WITH csv
;

EOF

# At this point,
# my table, ydata, should be full of rows from /tmp/ydata/ydata.csv'

echo 'Here is the load report:'

psql<<EOF
SELECT MIN(ydate),COUNT(tkr),MAX(ydate) FROM ydata;

SELECT tkr, MIN(ydate),COUNT(tkr),MAX(ydate) FROM ydata GROUP BY tkr ORDER BY tkr ;
EOF


# Since I am about to UPDATE the closing_price column,
# I will backup the data in it.
echo 'The command below might issue an error:'
echo 'ERROR:  table "ydata_copy" does not exist'
echo 'I need to drop it before I create and refill it.'
psql<<EOF
DROP   TABLE ydata_copy;
-- Above command might give error

CREATE TABLE ydata_copy AS
SELECT
tkr
,ydate   
,opn     
,mx      
,mn      
,closing_price
,vol     
,adjclse 
,closing_price AS closing_price_orig
FROM ydata
ORDER BY tkr,ydate
;

DROP   TABLE ydata;
CREATE TABLE ydata AS SELECT * FROM ydata_copy;

-- Run the load report again

SELECT MIN(ydate),COUNT(tkr),MAX(ydate) FROM ydata;

SELECT tkr, MIN(ydate),COUNT(tkr),MAX(ydate) FROM ydata GROUP BY tkr ORDER BY tkr ;
EOF

# So, now I have the data loaded.  Usually my next step is to cd ../
# and call 
#  ../cr_upd_cp.bash 
# to create 
#  ../update_closing_price.sql.
# Then, I call update_closing_price.sql to UPDATE closing_price in ydata.

# Another thing I might do is look for abrupt, abnormal 
# changes in closing_price using ../qry_abrupt.sql
# After I UPDATE, those abrupt changes (due to stock splits) 
# should be gone.

# The scripts in ../ should work for both Oracle and Postgres.

exit

The main coding observation I offer about the above script is that it is a shell script with SQL embedded inside of it. I am currently tempted to refactor the script so that I move the SQL syntax out into separate files. For a blog post though, I placed everything in one script so the reader could step through each line and study the flow.

A screen dump of me running the above script is displayed below:


dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ cd postgres/
dan@ub97hp_scikit ~/sa/postgres $ 
dan@ub97hp_scikit ~/sa/postgres $ 
dan@ub97hp_scikit ~/sa/postgres $ ll
total 20
drwxrwxr-x 2 dan dan 4096 Nov 28 19:58 ./
drwxr-xr-x 5 dan dan 4096 Nov 28 19:58 ../
-rw-rw-r-- 1 dan dan  392 Nov 28 19:58 cr_ydata.sql
-rwxrwxr-x 1 dan dan 3358 Nov 28 19:58 load_ydata.bash*
-rw-rw-r-- 1 dan dan  758 Nov 28 19:58 readme_authentication.txt
dan@ub97hp_scikit ~/sa/postgres $ 
dan@ub97hp_scikit ~/sa/postgres $ 
dan@ub97hp_scikit ~/sa/postgres $ ./load_ydata.bash 
Here is head and tail of the CSV file I want to load:
DD,2013-11-27,61.50,61.81,61.30,61.54,1598200,61.54
DD,2013-11-26,61.61,61.74,61.16,61.32,2147900,61.32
DD,2013-11-25,62.00,62.19,61.38,61.58,3772100,61.58
KO,1962-01-04,99.00,100.25,99.00,99.50,211200,0.26
KO,1962-01-03,99.50,99.50,97.25,98.75,393600,0.26
KO,1962-01-02,101.00,103.75,101.00,101.00,201600,0.26
I might see an error here:
ERROR:  relation "ydata" already exists
It is okay. I need to ensure that ydata exists
before I try to fill it.
CREATE TABLE
TRUNCATE TABLE
COPY 52271
Here is the load report:
    min     | count |    max     
------------+-------+------------
 1962-01-02 | 52271 | 2013-11-27
(1 row)

 tkr |    min     | count |    max     
-----+------------+-------+------------
 DD  | 1962-01-02 | 13067 | 2013-11-27
 DIS | 1962-01-02 | 13068 | 2013-11-27
 IBM | 1962-01-02 | 13068 | 2013-11-27
 KO  | 1962-01-02 | 13068 | 2013-11-27
(4 rows)

The command below might issue an error:
ERROR:  table "ydata_copy" does not exist
I need to drop it before I create and refill it.
ERROR:  table "ydata_copy" does not exist
SELECT 52271
DROP TABLE
SELECT 52271
    min     | count |    max     
------------+-------+------------
 1962-01-02 | 52271 | 2013-11-27
(1 row)

 tkr |    min     | count |    max     
-----+------------+-------+------------
 DD  | 1962-01-02 | 13067 | 2013-11-27
 DIS | 1962-01-02 | 13068 | 2013-11-27
 IBM | 1962-01-02 | 13068 | 2013-11-27
 KO  | 1962-01-02 | 13068 | 2013-11-27
(4 rows)

dan@ub97hp_scikit ~/sa/postgres $ 
dan@ub97hp_scikit ~/sa/postgres $ 
dan@ub97hp_scikit ~/sa/postgres $ 

Next, I work towards locating split dates for the four tickers that I had just loaded. This is a manual process. The web is a good source of split dates:

https://www.google.com/search?q=stock+split+dates

Also, to ease the chore, I wrote a SQL script which flags rows which had large changes in price over the span of one day:

--
-- ~/sa/qry_abrupt.sql
--

-- I use this script to look for abrupt changes in closing prices.
-- Sometimes they are due to stock splits.

-- Usually I get split dates off sites like Yahoo but this script
-- can be useful.

SELECT
tkr
,ydate
,closing_price
,ROUND(closing_price/cp_next_day,1) poss_split_ratio
,cp_next_day
,cp_next_day - closing_price AS price_delta
,100 * (cp_next_day - closing_price)/closing_price AS price_delta_pct
FROM
(
  SELECT
  tkr
  ,ydate
  ,closing_price
  ,LEAD(closing_price) OVER (PARTITION BY tkr ORDER BY ydate) AS cp_next_day
  FROM ydata
  WHERE closing_price > 0
) subq
-- Any price delta > 20 pct is suspicious:
WHERE ABS(100 * (cp_next_day - closing_price)/closing_price) > 20
ORDER BY tkr,ydate
;

A screen dump of me running the above script is displayed below:

dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ psql -P pager=no -f qry_abrupt.sql 
 tkr |   ydate    | closing_price | poss_split_ratio | cp_next_day | price_delta |   price_delta_pct    
-----+------------+---------------+------------------+-------------+-------------+----------------------
 DD  | 1979-06-28 |        128.25 |              3.1 |       41.25 |      -87.00 | -67.8362573099415205
 DD  | 1990-01-19 |        124.00 |              3.1 |       39.88 |      -84.12 | -67.8387096774193548
 DD  | 1997-06-12 |        115.62 |              2.0 |       58.88 |      -56.74 | -49.0745545753329874
 DIS | 1967-11-15 |        105.00 |              1.9 |       54.00 |      -51.00 | -48.5714285714285714
 DIS | 1971-02-26 |        177.75 |              1.9 |       93.12 |      -84.63 | -47.6118143459915612
 DIS | 1973-01-15 |        214.50 |              2.0 |      108.00 |     -106.50 | -49.6503496503496503
 DIS | 1986-03-05 |        142.63 |              4.0 |       35.75 |     -106.88 | -74.9351468835448363
 DIS | 1987-10-16 |         64.87 |              1.4 |       46.00 |      -18.87 | -29.0889471250192693
 DIS | 1992-05-15 |        152.88 |              3.9 |       39.63 |     -113.25 | -74.0777080062794349
 DIS | 1998-07-09 |        111.00 |              2.9 |       38.13 |      -72.87 | -65.6486486486486486
 IBM | 1966-05-17 |        516.50 |              1.5 |      353.00 |     -163.50 | -31.6553727008712488
 IBM | 1968-04-22 |        636.00 |              1.9 |      327.50 |     -308.50 | -48.5062893081761006
 IBM | 1973-05-25 |        408.00 |              1.3 |      323.50 |      -84.50 | -20.7107843137254902
 IBM | 1979-05-31 |        304.00 |              4.0 |       76.25 |     -227.75 | -74.9177631578947368
 IBM | 1987-10-16 |        135.00 |              1.3 |      103.25 |      -31.75 | -23.5185185185185185
 IBM | 1997-05-27 |        179.25 |              2.0 |       90.12 |      -89.13 | -49.7238493723849372
 IBM | 1999-05-26 |        236.25 |              2.0 |      116.00 |     -120.25 | -50.8994708994708995
 KO  | 1965-02-18 |        148.75 |              2.0 |       74.50 |      -74.25 | -49.9159663865546218
 KO  | 1968-05-31 |        147.50 |              2.0 |       74.00 |      -73.50 | -49.8305084745762712
 KO  | 1977-05-31 |         71.87 |              1.9 |       37.25 |      -34.62 | -48.1703074996521497
 KO  | 1986-06-30 |        125.50 |              2.9 |       42.75 |      -82.75 | -65.9362549800796813
 KO  | 1987-10-16 |         40.50 |              1.3 |       30.50 |      -10.00 | -24.6913580246913580
 KO  | 1990-05-11 |         83.25 |              2.0 |       42.00 |      -41.25 | -49.5495495495495495
 KO  | 1992-05-11 |         85.37 |              2.0 |       42.88 |      -42.49 | -49.7715825231345906
 KO  | 1996-05-10 |         83.87 |              2.0 |       43.00 |      -40.87 | -48.7301776558960296
 KO  | 2012-08-10 |         78.79 |              2.0 |       39.30 |      -39.49 | -50.1205736768625460
(26 rows)

dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ 

A bit of detective work reveals that all of dates listed above were split dates except 1987-10-16 which is an outlier that I want my algorithms to learn about:

https://www.google.com/search?q=stock+market+crash+october+1987

After I locate split dates for stocks I enter them into a simple flat file. I have not yet automated this process so I do it by hand. The file looks like this:


# ~/sa/tkr_split_dates.csv
# This file is a demo of a file I use to collect split dates of tickers.
# Once this file is up to date,
# I use a script to read this file and issue a series of 
# UPDATE commands to make closing_price in ydata suitable for
# creating moving average calculations.

# Originally I wanted to separate the three columns with comma.
# But now I like plain white space.

# Some of the data below was obtained from ~/sa/qry_abrupt.sql
# Some was obtained from google of 'stock split dates.'

# Format:
# ticker  date  old_price/new_price

DD   1979-06-28  3
DD   1990-01-19  3
DD   1997-06-12  2 

DIS  1962-12-17  103/100
DIS  1967-11-15  2
DIS  1971-02-26  2
DIS  1973-01-15  2
DIS  1986-03-05  4
DIS  1992-05-15  4
DIS  1998-07-09  3

IBM  1964-05-15  5/4
IBM  1966-05-17  3/2
IBM  1968-04-22  2
IBM  1973-05-25  5/4
IBM  1979-05-31  4 
IBM  1997-05-27  2
IBM  1999-05-26  2

KO   1965-02-18  2  
KO   1968-05-31  2  
KO   1977-05-31  2
KO   1986-06-30  3
KO   1990-05-11  2
KO   1992-05-11  2
KO   1996-05-10  2
KO   2012-08-10  2

Next, I use a script to copy data out of the above flat file into a SQL script:


#!/bin/bash

# ~/sa/cr_upd_cp.bash

# I use this script to create an UPDATE closing_price script 
# from tkr_split_dates.csv

echo "This file created by $0" > update_closing_price_awk.txt
grep 1 tkr_split_dates.csv|awk '{print "UPDATE ydata SET closing_price=closing_price*"$3" WHERE tkr=:"$1": AND ydate>:"$2":;"}' >> update_closing_price_awk.txt

head -3 update_closing_price_awk.txt
tail -3 update_closing_price_awk.txt
echo "-- This script created by $0" > update_closing_price.sql
grep 1 update_closing_price_awk.txt | sed '1,$s/:/'"'"'/g' >> update_closing_price.sql
echo 'COMMIT;' >> update_closing_price.sql

head -3 update_closing_price.sql
tail -3 update_closing_price.sql

echo I should be able to run
echo update_closing_price.sql
echo now.

echo Oracle:
echo 'sqlplus trade/t @update_closing_price.sql'
echo Postgres:
echo 'psql -f update_closing_price.sql'

exit

A screen dump of me running the above script is displayed below:


dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ ./cr_upd_cp.bash 
This file created by ./cr_upd_cp.bash
UPDATE ydata SET closing_price=closing_price*3 WHERE tkr=:DD: AND ydate>:1979-06-28:;
UPDATE ydata SET closing_price=closing_price*3 WHERE tkr=:DD: AND ydate>:1990-01-19:;
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr=:KO: AND ydate>:1992-05-11:;
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr=:KO: AND ydate>:1996-05-10:;
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr=:KO: AND ydate>:2012-08-10:;
-- This script created by ./cr_upd_cp.bash
UPDATE ydata SET closing_price=closing_price*3 WHERE tkr='DD' AND ydate>'1979-06-28';
UPDATE ydata SET closing_price=closing_price*3 WHERE tkr='DD' AND ydate>'1990-01-19';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='KO' AND ydate>'1996-05-10';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='KO' AND ydate>'2012-08-10';
COMMIT;
I should be able to run
update_closing_price.sql
now.
Oracle:
sqlplus trade/t @update_closing_price.sql
Postgres:
psql -f update_closing_price.sql
dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ 



dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ cat update_closing_price.sql
-- This script created by ./cr_upd_cp.bash
UPDATE ydata SET closing_price=closing_price*3 WHERE tkr='DD' AND ydate>'1979-06-28';
UPDATE ydata SET closing_price=closing_price*3 WHERE tkr='DD' AND ydate>'1990-01-19';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='DD' AND ydate>'1997-06-12';
UPDATE ydata SET closing_price=closing_price*103/100 WHERE tkr='DIS' AND ydate>'1962-12-17';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='DIS' AND ydate>'1967-11-15';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='DIS' AND ydate>'1971-02-26';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='DIS' AND ydate>'1973-01-15';
UPDATE ydata SET closing_price=closing_price*4 WHERE tkr='DIS' AND ydate>'1986-03-05';
UPDATE ydata SET closing_price=closing_price*4 WHERE tkr='DIS' AND ydate>'1992-05-15';
UPDATE ydata SET closing_price=closing_price*3 WHERE tkr='DIS' AND ydate>'1998-07-09';
UPDATE ydata SET closing_price=closing_price*5/4 WHERE tkr='IBM' AND ydate>'1964-05-15';
UPDATE ydata SET closing_price=closing_price*3/2 WHERE tkr='IBM' AND ydate>'1966-05-17';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='IBM' AND ydate>'1968-04-22';
UPDATE ydata SET closing_price=closing_price*5/4 WHERE tkr='IBM' AND ydate>'1973-05-25';
UPDATE ydata SET closing_price=closing_price*4 WHERE tkr='IBM' AND ydate>'1979-05-31';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='IBM' AND ydate>'1997-05-27';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='IBM' AND ydate>'1999-05-26';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='KO' AND ydate>'1965-02-18';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='KO' AND ydate>'1968-05-31';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='KO' AND ydate>'1977-05-31';
UPDATE ydata SET closing_price=closing_price*3 WHERE tkr='KO' AND ydate>'1986-06-30';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='KO' AND ydate>'1990-05-11';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='KO' AND ydate>'1992-05-11';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='KO' AND ydate>'1996-05-10';
UPDATE ydata SET closing_price=closing_price*2 WHERE tkr='KO' AND ydate>'2012-08-10';
COMMIT;
dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ 

A screen dump of me running the generated script is displayed below:


dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ psql -P pager=no -f update_closing_price.sql 
UPDATE 8683
UPDATE 6013
UPDATE 4143
UPDATE 12825
UPDATE 11587
UPDATE 10788
UPDATE 10313
UPDATE 6994
UPDATE 5426
UPDATE 3873
UPDATE 12470
UPDATE 11965
UPDATE 11481
UPDATE 10222
UPDATE 8703
UPDATE 4155
UPDATE 3651
UPDATE 12278
UPDATE 11453
UPDATE 9208
UPDATE 6913
UPDATE 5935
UPDATE 5430
UPDATE 4418
UPDATE 326
psql:update_closing_price.sql:27: WARNING:  there is no transaction in progress
COMMIT
dan@ub97hp_scikit ~/sa $ 
dan@ub97hp_scikit ~/sa $ 

At this point I considered the closing_price column in ydata to contain prices which were suitable for moving average calculations needed by my machine learning algorithms.




Page Top

Stock Market Backtest with PostgreSQL 9.2 and LibSVM



The aim of this tech-tip is to help you operate LibSVM Machine Learning software.

LibSVM is open source softare associated with Chih-Chung Chang and Chih-Jen Lin at the National Taiwan University.

The software can be used to implement the SVM Algorithm:

http://en.wikipedia.org/wiki/Support_vector_machine

I intend to make this tech-tip interesting by showing how I fed stock market data to an instance of LibSVM.

I started my study of LibSVM by noting the LibSVM website:

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

I book-marked the download URL:

http://www.csie.ntu.edu.tw/~cjlin/libsvm/libsvm-3.17.tar.gz

I downloaded the software to my home directory on a 12.04 Ubuntu host:

dan@libsvm_hp ~/libsvm $
dan@libsvm_hp ~/libsvm $ wget http://www.csie.ntu.edu.tw/~cjlin/libsvm/libsvm-3.17.tar.gz
--2013-11-14 20:26:55--  http://www.csie.ntu.edu.tw/~cjlin/libsvm/libsvm-3.17.tar.gz
Resolving www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)... 140.112.30.28
Connecting to www.csie.ntu.edu.tw (www.csie.ntu.edu.tw)|140.112.30.28|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607892 (594K) [application/x-gzip]
Saving to: `libsvm-3.17.tar.gz'
100%[======================================>] 607,892      205K/s   in 2.9s
2013-11-14 20:26:58 (205 KB/s) - `libsvm-3.17.tar.gz' saved [607892/607892]
dan@libsvm_hp ~/libsvm $
dan@libsvm_hp ~/libsvm $
dan@libsvm_hp ~/libsvm $


I unpacked it in a tmp directory:

dan@libsvm_hp ~/libsvm $
dan@libsvm_hp ~/libsvm $ cd tmp
dan@libsvm_hp ~/libsvm/tmp $
dan@libsvm_hp ~/libsvm/tmp $ tar zxf ../libsvm-3.17.tar.gz
dan@libsvm_hp ~/libsvm/tmp $
dan@libsvm_hp ~/libsvm/tmp $


I looked at it:

dan@libsvm_hp ~/libsvm/tmp $
dan@libsvm_hp ~/libsvm/tmp $ ll
total 12
drwxrwxr-x 3 dan dan 4096 Nov 14 20:28 ./
drwxrwxr-x 8 dan dan 4096 Nov 14 20:26 ../
drwxr-xr-x 8 dan dan 4096 Mar 31  2013 libsvm-3.17/
dan@libsvm_hp ~/libsvm/tmp $
dan@libsvm_hp ~/libsvm/tmp $


I moved the unpacked software out of the tmp directory:

dan@libsvm_hp ~/libsvm/tmp $
dan@libsvm_hp ~/libsvm/tmp $ mv libsvm-3.17/ ..
dan@libsvm_hp ~/libsvm/tmp $


I took a closer look at the software:

dan@libsvm_hp ~/libsvm/libsvm-3.17 $
dan@libsvm_hp ~/libsvm/libsvm-3.17 $
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ ll
total 276
drwxr-xr-x 8 dan dan  4096 Mar 31  2013 ./
drwxrwxr-x 9 dan dan  4096 Nov 14 20:30 ../
-rw-r--r-- 1 dan dan  1497 Mar 31  2013 COPYRIGHT
-rw-r--r-- 1 dan dan 73850 Mar 31  2013 FAQ.html
-rw-r--r-- 1 dan dan 27670 Mar 31  2013 heart_scale
drwxr-xr-x 3 dan dan  4096 Mar 31  2013 java/
-rw-r--r-- 1 dan dan   732 Mar 31  2013 Makefile
-rw-r--r-- 1 dan dan  1087 Mar 31  2013 Makefile.win
drwxr-xr-x 2 dan dan  4096 Mar 31  2013 matlab/
drwxr-xr-x 2 dan dan  4096 Mar 31  2013 python/
-rw-r--r-- 1 dan dan 28271 Mar 31  2013 README
-rw-r--r-- 1 dan dan 64575 Mar 31  2013 svm.cpp
-rw-r--r-- 1 dan dan   477 Mar 31  2013 svm.def
-rw-r--r-- 1 dan dan  3382 Mar 31  2013 svm.h
-rw-r--r-- 1 dan dan  5536 Mar 31  2013 svm-predict.c
-rw-r--r-- 1 dan dan  7891 Mar 31  2013 svm-scale.c
drwxr-xr-x 5 dan dan  4096 Mar 31  2013 svm-toy/
-rw-r--r-- 1 dan dan  8978 Mar 31  2013 svm-train.c
drwxr-xr-x 2 dan dan  4096 Mar 31  2013 tools/
drwxr-xr-x 2 dan dan  4096 Mar 31  2013 windows/
dan@libsvm_hp ~/libsvm/libsvm-3.17 $
dan@libsvm_hp ~/libsvm/libsvm-3.17 $


The README file gave one line of terse installation instructions:

"On Unix systems, type make to build the svm-train and svm-predict programs."

I looked for more information in Google:

https://www.google.com/search?q=How+I+install+libsvm+on+ubuntu

This looked useful:

http://stackoverflow.com/questions/10975836/how-do-i-run-libsvm-on-linux

It suggested I install the 'build-essential' Ubuntu package.

I installed it with this shell command:

apt-get install build-essential

I then used the make shell command to compile the LibSVM executables:

dan@libsvm_hp ~/libsvm/libsvm-3.17 $
dan@libsvm_hp ~/libsvm/libsvm-3.17 $
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ make
g++ -Wall -Wconversion -O3 -fPIC -c svm.cpp
svm.cpp: In function 'svm_model* svm_load_model(const char*)':
svm.cpp:2778:24: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm.cpp:2782:25: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm.cpp:2807:25: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm.cpp:2831:33: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm.cpp:2833:33: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm.cpp:2835:33: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm.cpp:2837:36: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm.cpp:2839:29: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm.cpp:2845:36: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm.cpp:2852:37: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm.cpp:2859:38: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm.cpp:2866:38: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm.cpp:2873:35: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
g++ -Wall -Wconversion -O3 -fPIC svm-train.c svm.o -o svm-train -lm
g++ -Wall -Wconversion -O3 -fPIC svm-predict.c svm.o -o svm-predict -lm
g++ -Wall -Wconversion -O3 -fPIC svm-scale.c -o svm-scale
svm-scale.c: In function 'int main(int, char**)':
svm-scale.c:221:55: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm-scale.c:222:51: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
svm-scale.c:230:51: warning: ignoring return value of 'int fscanf(FILE*, const char*, ...)'
dan@libsvm_hp ~/libsvm/libsvm-3.17 $
dan@libsvm_hp ~/libsvm/libsvm-3.17 $
dan@libsvm_hp ~/libsvm/libsvm-3.17 $


I looked for the new executables:

dan@libsvm_hp ~/libsvm/libsvm-3.17 $
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ ll svm-*
-rwxrwxr-x 1 dan dan 74020 Nov 14 20:39 svm-predict*
-rw-r--r-- 1 dan dan  5536 Mar 31  2013 svm-predict.c
-rwxrwxr-x 1 dan dan 18362 Nov 14 20:39 svm-scale*
-rw-r--r-- 1 dan dan  7891 Mar 31  2013 svm-scale.c
-rwxrwxr-x 1 dan dan 78407 Nov 14 20:39 svm-train*
-rw-r--r-- 1 dan dan  8978 Mar 31  2013 svm-train.c
dan@libsvm_hp ~/libsvm/libsvm-3.17 $
dan@libsvm_hp ~/libsvm/libsvm-3.17 $
dan@libsvm_hp ~/libsvm/libsvm-3.17 $


I verified they would talk to me:


dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ ./svm-predict
Usage: svm-predict [options] test_file model_file output_file
options:
-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0); for one-class SVM only 0 is supported
-q : quiet mode (no outputs)
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 



dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ ./svm-scale
Usage: svm-scale [options] data_filename
options:
-l lower : x scaling lower limit (default -1)
-u upper : x scaling upper limit (default +1)
-y y_lower y_upper : y scaling limits (default: no y scaling)
-s save_filename : save scaling parameters to save_filename
-r restore_filename : restore scaling parameters from restore_filename
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 



dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ ./svm-train
Usage: svm-train [options] training_set_file [model_file]
options:
-s svm_type : set type of SVM (default 0)
	0 -- C-SVC		(multi-class classification)
	1 -- nu-SVC		(multi-class classification)
	2 -- one-class SVM
	3 -- epsilon-SVR	(regression)
	4 -- nu-SVR		(regression)
-t kernel_type : set type of kernel function (default 2)
	0 -- linear: u'*v
	1 -- polynomial: (gamma*u'*v + coef0)^degree
	2 -- radial basis function: exp(-gamma*|u-v|^2)
	3 -- sigmoid: tanh(gamma*u'*v + coef0)
	4 -- precomputed kernel (kernel values in training_set_file)
-d degree : set degree in kernel function (default 3)
-g gamma : set gamma in kernel function (default 1/num_features)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)
-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)
-m cachesize : set cache memory size in MB (default 100)
-e epsilon : set tolerance of termination criterion (default 0.001)
-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1)
-b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
-wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1)
-v n: n-fold cross validation mode
-q : quiet mode (no outputs)
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 
dan@libsvm_hp ~/libsvm/libsvm-3.17 $ 


Next, I worked towards the goal of pulling interesting stock market data from Yahoo.com.

I am interested in an ETF named SPY.

This ETF is like a stock; I can buy and sell it from my brokerage account.

SPY is interesting because its price closely mirrors the price of the S and P 500 stock market index:

http://www.google.com/finance?q=SPY

http://finance.yahoo.com/q?s=SPY

Here is a screen dump of me getting daily closing prices of SPY:


dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ pwd
/tmp
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ wget http://ichart.finance.yahoo.com/table.csv?s=SPY --output-document=temp.csv
--2013-11-14 21:54:47--  http://ichart.finance.yahoo.com/table.csv?s=SPY
Resolving ichart.finance.yahoo.com (ichart.finance.yahoo.com)... 67.195.146.181
Connecting to ichart.finance.yahoo.com (ichart.finance.yahoo.com)|67.195.146.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: `temp.csv'

    [     <=>                               ] 276,737      312K/s   in 0.9s    

2013-11-14 21:54:48 (312 KB/s) - `temp.csv' saved [276737]

dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ head -3 temp.csv
Date,Open,High,Low,Close,Volume,Adj Close
2013-11-13,176.09,178.43,176.09,178.38,102909700,178.38
2013-11-12,176.94,177.36,176.37,176.96,83793500,176.96
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ grep -v Date temp.csv > SPY_prices.csv
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ head -3 SPY_prices.csv
2013-11-13,176.09,178.43,176.09,178.38,102909700,178.38
2013-11-12,176.94,177.36,176.37,176.96,83793500,176.96
2013-11-11,177.12,177.53,176.91,177.32,66267300,177.32
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ tail -3 SPY_prices.csv
1993-02-02,44.22,44.38,44.13,44.34,201300,30.22
1993-02-01,43.97,44.25,43.97,44.25,480500,30.15
1993-01-29,43.97,43.97,43.75,43.94,1003200,29.94
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ 


Then, I loaded the CSV data into a Postgres database table.

If you want to use Postgres I have a demo of installing it at the URL lsited below:

http://bikle.com/techtips/two#postgres924

Here is the syntax I used to create the table:


dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ cat ~/libsvm/pg/cr_tkr_date_prices.sql
--
-- ~/libsvm/pg/cr_tkr_date_prices.sql
--

-- I use this script to create a table in Postgres named, tkr_date_prices,
-- if it not exists.

DROP   TABLE IF EXISTS     tkr_date_prices;
CREATE TABLE IF NOT EXISTS tkr_date_prices
(
tkr       VARCHAR(7)
,ydate    DATE
,opn      DECIMAL
,hhigh    DECIMAL
,llow     DECIMAL
,closing_price DECIMAL
,volume   DECIMAL
,adjclose DECIMAL
)
;
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 


Here is a screen capture of me running the above script:

dan@libsvm_hp ~/libsvm/pg $
dan@libsvm_hp ~/libsvm/pg $
dan@libsvm_hp ~/libsvm/pg $ psql -P pager=no -f ~/libsvm/pg/cr_tkr_date_prices.sql
DROP TABLE
CREATE TABLE
dan@libsvm_hp ~/libsvm/pg $
dan@libsvm_hp ~/libsvm/pg $
dan@libsvm_hp ~/libsvm/pg $


Next, I copied data from the CSV file into the table:


dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ ll /tmp/SPY_prices.csv 
-rw-rw-r-- 1 dan dan 276695 Nov 14 21:55 /tmp/SPY_prices.csv
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ cp /tmp/SPY_prices.csv /tmp/tmp.csv
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 


dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ cat ~/libsvm/pg/copy_spy2table.sql
--
-- ~/libsvm/pg/copy_spy2table.sql
--

-- I use this script to copy data from a CSV file into a 
-- Postgres table, tkr_date_prices.

-- Empty the table; I want no duplicates:
TRUNCATE TABLE tkr_date_prices;

-- The Postgres server process does the work.
-- It is running relative to another directory.
-- I need to give it the full path to the CSV file.
COPY tkr_date_prices (
ydate     
,opn      
,hhigh    
,llow     
,closing_price
,volume   
,adjclose
) FROM '/tmp/tmp.csv' WITH csv
;

-- Someday, tkr_date_prices may contain prices from
-- other ETFs like QQQ or GLD:
UPDATE tkr_date_prices
SET tkr = 'SPY'
;

-- rpt
SELECT COUNT(tkr)
FROM tkr_date_prices
WHERE tkr = 'SPY'
;

dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 




dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ psql -P pager=no -f ~/libsvm/pg/copy_spy2table.sql
TRUNCATE TABLE
COPY 5238
UPDATE 5238
 count 
-------
  5238
(1 row)

dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 


At this point I had 5238 rows of SPY prices from 1993 until today loaded into at Postgres table named, tkr_date_prices.

Next, I built a SQL script to generate some vector data for LibSVM. The idea behind the vectors is leverage the predictive nature of the RSI technical indicator.

Some URLs which have RSI discussion are listed below:

https://www.google.com/search?q=How+I+use+RSI+to+predict+stocks

http://en.wikipedia.org/wiki/Relative_strength_index

http://www.tradingmarkets.com/recent/why_rsi_may_be_one_of_the_best_short-term_indicators_-673367.html



dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ cat table2rsi_vector.sql 
--
-- ~/libsvm/pg/table2rsi_vector.sql
--

-- I use this script to create vector-element values for LibSVM.
-- This script is only used for SPY.

DROP   TABLE IF EXISTS t2v10;
CREATE TABLE t2v10 AS
SELECT
tkr
,ydate
,closing_price
,closing_price cp
FROM tkr_date_prices
WHERE tkr = 'SPY'
;

-- rpt
SELECT COUNT(tkr) FROM t2v10;


DROP   TABLE IF EXISTS t2v12;
CREATE TABLE t2v12 AS
SELECT
tkr
,ydate
,closing_price
,cp
-- copy the previous row price to this row
,LAG(cp,1,NULL)OVER(PARTITION BY tkr ORDER BY ydate)  price_before
-- Capture normalized gains for 1,2,3,4,5 day holding periods:
,(LEAD(cp,1,NULL) OVER (PARTITION BY tkr ORDER BY ydate) - cp)/cp ng1
,(LEAD(cp,2,NULL) OVER (PARTITION BY tkr ORDER BY ydate) - cp)/cp ng2
,(LEAD(cp,3,NULL) OVER (PARTITION BY tkr ORDER BY ydate) - cp)/cp ng3
,(LEAD(cp,4,NULL) OVER (PARTITION BY tkr ORDER BY ydate) - cp)/cp ng4
,(LEAD(cp,5,NULL) OVER (PARTITION BY tkr ORDER BY ydate) - cp)/cp ng5
FROM t2v10
ORDER BY ydate
;

-- rpt
SELECT
ydate
,cp
,ng1
,ng3
FROM t2v12
WHERE 10 + ydate > (SELECT MAX(ydate) FROM t2v12) 
;

DROP   TABLE IF EXISTS t2v14;
CREATE TABLE t2v14 AS
SELECT
tkr
,ydate
,closing_price
,cp
,price_before
,ng1
,ng2
,ng3
,ng4
,ng5
,(closing_price - price_before) d01
FROM t2v12
ORDER BY tkr,ydate
;

DROP   TABLE IF EXISTS t2v16;
CREATE TABLE t2v16 AS
SELECT
tkr
,ydate
,closing_price
,cp
,ng1
,ng2
,ng3
,ng4
,ng5
,CASE WHEN d01 < 0 THEN -d01 ELSE 0 END dayloss
,CASE WHEN d01 > 0 THEN d01 ELSE 0 END  daygain
FROM t2v14
ORDER BY tkr,ydate
;

-- rpt
SELECT
ydate
,cp
,dayloss
,daygain
FROM t2v16
WHERE 10 + ydate > (SELECT MAX(ydate) FROM t2v12) 
ORDER BY tkr,ydate
;

SELECT
COUNT(dayloss)
,AVG(dayloss)
,STDDEV(dayloss)
,MAX(dayloss)
FROM t2v16
;

SELECT
COUNT(daygain)
,AVG(daygain)
,STDDEV(daygain)
,MAX(daygain)
FROM t2v16
;

DROP   TABLE IF EXISTS t2v18;
CREATE TABLE t2v18 AS
SELECT
tkr
,ydate
,closing_price
,cp
,ng1
,ng2
,ng3
,ng4
,ng5
,AVG(dayloss)OVER(PARTITION BY tkr ORDER BY ydate ROWS BETWEEN 4 PRECEDING AND CURRENT ROW)mv_avg_loss5
,AVG(daygain)OVER(PARTITION BY tkr ORDER BY ydate ROWS BETWEEN 4 PRECEDING AND CURRENT ROW)mv_avg_gain5
,AVG(dayloss)OVER(PARTITION BY tkr ORDER BY ydate ROWS BETWEEN 8 PRECEDING AND CURRENT ROW)mv_avg_loss9
,AVG(daygain)OVER(PARTITION BY tkr ORDER BY ydate ROWS BETWEEN 8 PRECEDING AND CURRENT ROW)mv_avg_gain9
,AVG(dayloss)OVER(PARTITION BY tkr ORDER BY ydate ROWS BETWEEN 13 PRECEDING AND CURRENT ROW)mv_avg_loss14
,AVG(daygain)OVER(PARTITION BY tkr ORDER BY ydate ROWS BETWEEN 13 PRECEDING AND CURRENT ROW)mv_avg_gain14
FROM t2v16
ORDER BY tkr,ydate
;

-- rpt
SELECT
COUNT(mv_avg_loss9)
,AVG(mv_avg_loss9)
,STDDEV(mv_avg_loss9)
,MAX(mv_avg_loss9)
FROM t2v18
;

SELECT
COUNT(mv_avg_gain9)
,AVG(mv_avg_gain9)
,STDDEV(mv_avg_gain9)
,MAX(mv_avg_gain9)
FROM t2v18
;


DROP   TABLE IF EXISTS t2v20;
CREATE TABLE t2v20 AS
SELECT
tkr
,ydate
,closing_price
,cp
,ng1
,ng2
,ng3
,ng4
,ng5
,CASE WHEN mv_avg_loss5=0 THEN 100.0
 ELSE 0.1 + 100.0 * ( 1 - 1/(1 + mv_avg_gain5/(mv_avg_loss5+0.0001)))END rsi5
,CASE WHEN mv_avg_loss9=0 THEN 100.0
 ELSE 0.1 + 100.0 * ( 1 - 1/(1 + mv_avg_gain9/(mv_avg_loss9+0.0001)))END rsi9
,CASE WHEN mv_avg_loss14=0 THEN 100.0
 ELSE 0.1 + 100.0 * ( 1 - 1/(1 + mv_avg_gain14/(mv_avg_loss14+0.0001)))END rsi14
-- 200 day moving avg is useful:
,AVG(cp)OVER(PARTITION BY tkr ORDER BY ydate ROWS BETWEEN 200 PRECEDING AND 0 PRECEDING) ma200
FROM t2v18
ORDER BY tkr,ydate
;

-- rpt
SELECT
COUNT(rsi9)
,MIN(rsi9)
,AVG(rsi9)
,STDDEV(rsi9)
,MAX(rsi9)
FROM t2v20
;

SELECT
COUNT(rsi9)
,CORR(rsi9,ng2)
FROM t2v20
;

-- Display some data from a popular heuristic:
SELECT AVG(ng2) FROM t2v20 ;
SELECT AVG(ng2) FROM t2v20 WHERE rsi9 < 30;
SELECT AVG(ng2) FROM t2v20 WHERE rsi9 < 30 AND cp > ma200;
SELECT AVG(ng2) FROM t2v20 WHERE rsi9 > 80;
SELECT AVG(ng2) FROM t2v20 WHERE rsi9 > 80 AND cp < ma200;

DROP   TABLE IF EXISTS t2v22;
CREATE TABLE t2v22 AS
SELECT
tkr
,ydate
,closing_price
,cp
,ng1
,ng2
,ng3
,ng4
,ng5
,rsi5 / 100 rsi5r
,rsi9 / 100 rsi9r
,rsi14 / 100 rsi14r
,(cp - ma200) / cp cp_ma200
FROM t2v20
ORDER BY tkr,ydate
;

DROP   TABLE IF EXISTS t2v24;
CREATE TABLE t2v24 AS
SELECT
CASE WHEN ng2>0 THEN 1 ELSE -1 END yval
,rsi5r
,rsi9r
,rsi14r
,cp_ma200
FROM t2v22
WHERE ydate BETWEEN '1993-07-01' AND '2008-01-01'
ORDER BY tkr,ydate
;

-- Copy vectors to a CSV:

COPY t2v24 TO '/tmp/in_sample_ng2.csv' csv;

DROP   TABLE IF EXISTS t2v24;
CREATE TABLE t2v24 AS
SELECT
CASE WHEN ng2>0 THEN 1 ELSE -1 END yval
,rsi5r
,rsi9r
,rsi14r
,cp_ma200
FROM t2v22
WHERE ydate > '2008-01-01'
AND ng2 IS NOT NULL
ORDER BY tkr,ydate
;

-- Copy vectors to a CSV:

COPY t2v24 TO '/tmp/out_of_sample_ng2.csv' csv;

--
-- end of script
--


dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 


dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ psql -P pager=no -f table2rsi_vector.sql
DROP TABLE
SELECT 5238
 count 
-------
  5238
(1 row)

DROP TABLE
SELECT 5238
   ydate    |   cp   |           ng1           |           ng3           
------------+--------+-------------------------+-------------------------
 2013-11-04 | 176.83 | -0.00316688344737883843 | -0.01074478312503534468
 2013-11-05 | 176.27 |  0.00510580359675497816 |  0.00578657740965564191
 2013-11-06 | 177.17 | -0.01264322402212564204 |  0.00084664446576734210
 2013-11-07 | 174.93 |  0.01349111073000628823 |  0.01160464185674269708
 2013-11-08 | 177.29 |  0.00016921428168537425 |  0.00614811890123526426
 2013-11-11 | 177.32 | -0.00203022783667944958 |                        
 2013-11-12 | 176.96 |  0.00802441229656419530 |                        
 2013-11-13 | 178.38 |                         |                        
(8 rows)

DROP TABLE
SELECT 5238
DROP TABLE
SELECT 5238
   ydate    |   cp   | dayloss | daygain 
------------+--------+---------+---------
 2013-11-04 | 176.83 |       0 |    0.62
 2013-11-05 | 176.27 |    0.56 |       0
 2013-11-06 | 177.17 |       0 |    0.90
 2013-11-07 | 174.93 |    2.24 |       0
 2013-11-08 | 177.29 |       0 |    2.36
 2013-11-11 | 177.32 |       0 |    0.03
 2013-11-12 | 176.96 |    0.36 |       0
 2013-11-13 | 178.38 |       0 |    1.42
(8 rows)

 count |          avg           |         stddev         | max  
-------+------------------------+------------------------+------
  5238 | 0.43659221076746849943 | 0.85295600425251373462 | 9.83
(1 row)

 count |          avg           |         stddev         |  max  
-------+------------------------+------------------------+-------
  5238 | 0.46225849560901107293 | 0.80984964103940004276 | 12.85
(1 row)

DROP TABLE
SELECT 5238
 count |          avg           |                   stddev                   |        max         
-------+------------------------+--------------------------------------------+--------------------
  5238 | 0.43632633472123735585 | 0.3977431839663206934343532905878737616456 | 3.8700000000000000
(1 row)

 count |          avg           |                   stddev                   |        max         
-------+------------------------+--------------------------------------------+--------------------
  5238 | 0.46179449020285217316 | 0.3150903772440734557435298103867424698816 | 2.5833333333333333
(1 row)

DROP TABLE
SELECT 5238
 count |           min           |           avg            |                    stddev                     |  max  
-------+-------------------------+--------------------------+-----------------------------------------------+-------
  5238 | 0.100000000000000000000 | 54.866962282074033121749 | 19.556781708631022948289436787315253649557792 | 100.0
(1 row)

 count |        corr         
-------+---------------------
  5238 | -0.0637442085899942
(1 row)

            avg             
----------------------------
 0.000671887777474778190173
(1 row)

            avg             
----------------------------
 0.003448639404403284933361
(1 row)

          avg           
------------------------
 0.00248941246854572014
(1 row)

             avg             
-----------------------------
 -0.000543632042663437650169
(1 row)

             avg             
-----------------------------
 -0.002875340096588139004705
(1 row)

DROP TABLE
SELECT 5238
DROP TABLE
SELECT 3653
COPY 3653
DROP TABLE
SELECT 1477
COPY 1477

dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 


Then, I transformed the CSV files into a format suitable for LibSVM:


dan@libsvm_hp ~/libsvm $ 
dan@libsvm_hp ~/libsvm $ 
dan@libsvm_hp ~/libsvm $ cat ~/libsvm/pg/rsi_vectors2libsvm_format_ng2.bash
#!/bin/bash

# ~/libsvm/pg/rsi_vectors2libsvm_format_ng2.bash

# Note:
# ~/libsvm/phraug/csv2libsvm.py
# is available from github:
# https://github.com/zygmuntz/phraug

cd ~/libsvm/

python ~/libsvm/phraug/csv2libsvm.py /tmp/in_sample_ng2.csv     /tmp/in_sample_ng2_svm.txt
python ~/libsvm/phraug/csv2libsvm.py /tmp/out_of_sample_ng2.csv /tmp/out_of_sample_ng2_svm.txt

exit
dan@libsvm_hp ~/libsvm $ 
dan@libsvm_hp ~/libsvm $ 
dan@libsvm_hp ~/libsvm $ 

dan@libsvm_hp ~/libsvm $ 
dan@libsvm_hp ~/libsvm $ ~/libsvm/pg/rsi_vectors2libsvm_format_ng2.bash
dan@libsvm_hp ~/libsvm $ 
dan@libsvm_hp ~/libsvm $ cd /tmp/
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ ll in_sample_ng2.csv
-rw-r--r-- 1 postgres postgres 356856 Nov 15 01:35 in_sample_ng2.csv
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ ll in_sample_ng2_svm.txt 
-rw-rw-r-- 1 dan dan 386080 Nov 15 01:38 in_sample_ng2_svm.txt
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ ll out_of_sample_ng2*
-rw-r--r-- 1 postgres postgres 144386 Nov 15 01:35 out_of_sample_ng2.csv
-rw-rw-r-- 1 dan      dan      156202 Nov 15 01:38 out_of_sample_ng2_svm.txt
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ 


I used grid.py to find good values for C and g:


dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ python ~/libsvm/libsvm317/tools/grid.py /tmp/in_sample_ng2_svm.txt
[local] 5 -7 54.0925 (best c=32.0, g=0.0078125, rate=54.0925)
[local] -1 -7 54.0925 (best c=0.5, g=0.0078125, rate=54.0925)
[local] 5 -1 54.0652 (best c=0.5, g=0.0078125, rate=54.0925)
[local] -1 -1 54.0925 (best c=0.5, g=0.0078125, rate=54.0925)
[local] 11 -7 54.0925 (best c=0.5, g=0.0078125, rate=54.0925)
[local] 11 -1 53.0523 (best c=0.5, g=0.0078125, rate=54.0925)
[local] 5 -13 54.0925 (best c=0.5, g=0.0078125, rate=54.0925)
[local] -1 -13 54.0925 (best c=0.5, g=0.0078125, rate=54.0925)
[local] 11 -13 54.0925 (best c=0.5, g=0.0078125, rate=54.0925)
[local] -3 -7 54.0925 (best c=0.125, g=0.0078125, rate=54.0925)
[local] -3 -1 54.0925 (best c=0.125, g=0.0078125, rate=54.0925)
[local] -3 -13 54.0925 (best c=0.125, g=0.0078125, rate=54.0925)
[local] 5 1 53.8188 (best c=0.125, g=0.0078125, rate=54.0925)
[local] -1 1 54.0925 (best c=0.125, g=0.0078125, rate=54.0925)
[local] 11 1 54.8864 (best c=2048.0, g=2.0, rate=54.8864)
[local] -3 1 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 9 -7 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 9 -1 53.2165 (best c=2048.0, g=2.0, rate=54.8864)
[local] 9 -13 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 9 1 54.6126 (best c=2048.0, g=2.0, rate=54.8864)
[local] 5 -11 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -1 -11 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 11 -11 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -3 -11 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 9 -11 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 3 -7 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 3 -1 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 3 -13 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 3 1 53.5177 (best c=2048.0, g=2.0, rate=54.8864)
[local] 3 -11 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 5 -5 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -1 -5 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 11 -5 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -3 -5 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 9 -5 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 3 -5 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 15 -7 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 15 -1 53.7367 (best c=2048.0, g=2.0, rate=54.8864)
[local] 15 -13 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 15 1 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 15 -11 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 15 -5 54.0378 (best c=2048.0, g=2.0, rate=54.8864)
[local] 5 -15 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -1 -15 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 11 -15 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -3 -15 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 9 -15 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 3 -15 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 15 -15 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -5 -7 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -5 -1 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -5 -13 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -5 1 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -5 -11 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -5 -5 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -5 -15 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 5 3 54.3936 (best c=2048.0, g=2.0, rate=54.8864)
[local] -1 3 54.6126 (best c=2048.0, g=2.0, rate=54.8864)
[local] 11 3 53.5724 (best c=2048.0, g=2.0, rate=54.8864)
[local] -3 3 54.0652 (best c=2048.0, g=2.0, rate=54.8864)
[local] 9 3 53.6272 (best c=2048.0, g=2.0, rate=54.8864)
[local] 3 3 54.5305 (best c=2048.0, g=2.0, rate=54.8864)
[local] 15 3 54.3115 (best c=2048.0, g=2.0, rate=54.8864)
[local] -5 3 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 7 -7 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 7 -1 53.5998 (best c=2048.0, g=2.0, rate=54.8864)
[local] 7 -13 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 7 1 54.2568 (best c=2048.0, g=2.0, rate=54.8864)
[local] 7 -11 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 7 -5 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 7 -15 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 7 3 53.8188 (best c=2048.0, g=2.0, rate=54.8864)
[local] 5 -9 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -1 -9 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 11 -9 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -3 -9 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 9 -9 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 3 -9 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 15 -9 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -5 -9 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 7 -9 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 1 -7 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 1 -1 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 1 -13 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 1 1 53.8188 (best c=2048.0, g=2.0, rate=54.8864)
[local] 1 -11 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 1 -5 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 1 -15 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 1 3 54.4758 (best c=2048.0, g=2.0, rate=54.8864)
[local] 1 -9 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 5 -3 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -1 -3 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 11 -3 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] -3 -3 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 9 -3 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 3 -3 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 15 -3 53.0797 (best c=2048.0, g=2.0, rate=54.8864)
[local] -5 -3 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 7 -3 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 1 -3 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 13 -7 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 13 -1 53.5998 (best c=2048.0, g=2.0, rate=54.8864)
[local] 13 -13 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 13 1 54.5853 (best c=2048.0, g=2.0, rate=54.8864)
[local] 13 -11 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 13 -5 54.1199 (best c=2048.0, g=2.0, rate=54.8864)
[local] 13 -15 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 13 3 54.5305 (best c=2048.0, g=2.0, rate=54.8864)
[local] 13 -9 54.0925 (best c=2048.0, g=2.0, rate=54.8864)
[local] 13 -3 53.545 (best c=2048.0, g=2.0, rate=54.8864)
2048.0 2.0 54.8864
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ 


Note that grid.py gave me an interesting contour plot which shows effectiveness of different combinations of C and g.

The plot told me these are the optimal values of C and g based on the in-sample data: C = 2^11 which is 2048 and g = 2^1 which is 2.

Also I could see that maximum predictive power of SVM on subsets of the in-sample data was about 54.9% accurate.

I assumed that flipping a coin would give me predictions which are 50% accurate. So, it appeared that SVM was more predictive than flipping a coin (on subsets of the in-sample data).

And, I noticed the C and g values correspond to a green countour in the lower right corner of the plot:

Now that I had optimal values for C and g, I used the svm-train executable to learn from the in-sample data and build an SVM model:


dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ 
dan@libsvm_hp /tmp $ ~/libsvm/libsvm317/svm-train -b 1 -c 2048 -g 2.0 /tmp/in_sample_ng2_svm.txt
................................................................................................
optimization finished, #iter = 882348
nu = 0.894245
obj = -5290577.009233, rho = -4.382630
nSV = 2660, nBSV = 2571
Total nSV = 2660
.................................................................................................
optimization finished, #iter = 1008301
nu = 0.882616
obj = -5227110.649814, rho = -7.297140
nSV = 2627, nBSV = 2531
Total nSV = 2627
.................................................................................................
optimization finished, #iter = 834846
nu = 0.893346
obj = -5302859.367009, rho = -5.821379
nSV = 2659, nBSV = 2570
Total nSV = 2659
...................................................................................................
optimization finished, #iter = 1129112
nu = 0.884791
obj = -5235183.409801, rho = -9.764653
nSV = 2627, nBSV = 2535
Total nSV = 2627
....................................................................................................
optimization finished, #iter = 1055838
nu = 0.883575
obj = -5232717.985927, rho = -7.887029
nSV = 2626, nBSV = 2534
Total nSV = 2626
.....................................................................................................
optimization finished, #iter = 1231362
nu = 0.892937
obj = -6616892.218509, rho = -5.417116
nSV = 3310, nBSV = 3211
Total nSV = 3310
dan@libsvm_hp /tmp $ 


If you had studied the output from table2rsi_vector.sql above, you would notice that the in-sample data corresponds to SPY prices before 2008 and the out-of-sample data comes from prices after Jan 1 2008.

With my SVM-model trained, I used it to generate predictions for 1477 rows of out-of-sample data:

dan@libsvm_hp /tmp $
dan@libsvm_hp /tmp $   ~/libsvm/libsvm317/svm-predict -b 1 /tmp/out_of_sample_ng2_svm.txt /tmp/in_sample_ng2_svm.txt.model /tmp/out_of_sample_ng2_predictions.txt
Accuracy = 54.7055% (808/1477) (classification)
dan@libsvm_hp /tmp $
dan@libsvm_hp /tmp $
dan@libsvm_hp /tmp $


I noticed that the accuracy was almost identical to what SVM achieved while predicting sub sets of the in-sample data.

So it appeared that SVM was almost 5% more accurate than flipping a coin.

When I saw this result I asked, "Is this good enough to get rich?"


I decided to load the predictions into Postgres and do some digging.


dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ head -3 /tmp/out_of_sample_ng2_predictions.txt
labels 1 -1
1 0.54394 0.45606
1 0.55949 0.44051
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 



dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ /usr/bin/R -f number_ng2_predictions.r

R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> #
> # number_ng2_predictions.r
> #
> 
> # Demo:
> # /usr/bin/r -f number_ng2_predictions.r
> 
> predictions = read.csv('/tmp/out_of_sample_ng2_predictions.txt')
> write.table(predictions, file='/tmp/out_of_sample_ng2_predictions.csv', sep=',',col.names=FALSE)
> 
> # end
> 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 


dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ head -3 /tmp/out_of_sample_ng2_predictions.csv
"1","1 0.54394 0.45606"
"2","1 0.55949 0.44051"
"3","1 0.560521 0.439479"
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 

dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ sed -i '1,$s/"//g' /tmp/out_of_sample_ng2_predictions.csv 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ head -3 /tmp/out_of_sample_ng2_predictions.csv
1,1 0.54394 0.45606
2,1 0.55949 0.44051
3,1 0.560521 0.439479
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 



dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ sed -i '1,$s/ /,/g' /tmp/out_of_sample_ng2_predictions.csv 

dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ head -3 /tmp/out_of_sample_ng2_predictions.csv
1,1,0.54394,0.45606
2,1,0.55949,0.44051
3,1,0.560521,0.439479
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 


dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ cat ~/libsvm/pg/cr_predictions.sql
--
-- ~/libsvm/pg/cr_predictions.sql
--

-- I use this script to fill a table from a CSV file full of
-- numbered predictions.

DROP TABLE IF EXISTS predictions;
CREATE TABLE predictions (
pnum        INTEGER
,yval       INTEGER
,probup     DECIMAL
,probdown   DECIMAL
)
;

COPY predictions (
pnum
,yval
,probup
,probdown
) FROM '/tmp/out_of_sample_ng2_predictions.csv' WITH csv
;

-- rpt
SELECT 
yval
,COUNT(yval)
,MIN(pnum)
,MAX(pnum)
,MIN(probup)
,MAX(probup)
,MIN(probdown)
,MAX(probdown)
FROM predictions
GROUP BY yval ORDER BY yval 
;
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 


dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ psql -P pager=no -f ~/libsvm/pg/cr_predictions.sql
DROP TABLE
CREATE TABLE
COPY 1477
 yval | count | min | max  |   min    |   max    |    min    |   max    
------+-------+-----+------+----------+----------+-----------+----------
   -1 |    37 |  20 | 1429 | 0.317438 | 0.494547 |  0.505453 | 0.682562
    1 |  1440 |   1 | 1477 |      0.5 | 0.966614 | 0.0333862 |      0.5
(2 rows)

dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 


To generate the reports I wanted I joined the prediction data with the out of sample data:


dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ cat ~/libsvm/pg/qry_predictions_j_oos.sql
--
-- ~/libsvm/pg/qry_predictions_j_oos.sql
--

-- I use this script to join predictions with out of sample data
-- and then run interesting reports off the joined data.

-- The most convenient place to get the out of sample data
-- is from a table named t2v22 which was created by the
-- script: 
-- ~/libsvm/pg/table2rsi_vector.sql

DROP   TABLE IF EXISTS oosd10;
CREATE TABLE oosd10 AS
SELECT
ydate
,ng2
-- I need to number the rows so I can then join with predictions table
-- which also has numbered rows:
,row_number() OVER (PARTITION BY tkr ORDER BY ydate) onum
FROM t2v22
WHERE ydate > '2008-01-01'
AND ng2 IS NOT NULL
ORDER BY tkr,ydate
;

-- rpt
SELECT
MIN(onum)
,MAX(onum)
,COUNT(onum)
FROM oosd10
;

SELECT
MIN(pnum)
,MAX(pnum)
,COUNT(pnum)
FROM predictions
;

DROP   TABLE IF EXISTS oosd12;
CREATE TABLE oosd12 AS
SELECT
ydate
,ng2
,yval
,probup
,probdown
FROM predictions p, oosd10 o
WHERE pnum = onum
ORDER BY ydate
;

-- Now I can query values for the conusion matrix.

-- Count true-positives:
SELECT
COUNT(ydate) true_positives
FROM oosd12
WHERE yval = 1 AND ng2 > 0
;

-- Count true-negatives:
SELECT
COUNT(ydate) true_negatives
FROM oosd12
WHERE yval = -1 AND ng2 <= 0
;

-- Count false-positives:
SELECT
COUNT(ydate) false_positives
FROM oosd12
WHERE yval = 1 AND ng2 < 0
;

-- Count false-negatives:
SELECT
COUNT(ydate) false_negatives
FROM oosd12
WHERE yval = -1 AND ng2 > 0
;

-- Calculate the Accuracy:

SELECT
100 * SUM(true_positives+true_negatives)/SUM(true_positives+true_negatives+false_positives+false_negatives) accuracy
FROM
(
SELECT COUNT(ydate) true_positives,0 true_negatives,0 false_positives,0 false_negatives FROM oosd12 WHERE yval=1 AND ng2>0
UNION
SELECT 0 true_positives,COUNT(ydate) true_negatives,0 false_positives,0 false_negatives FROM oosd12 WHERE yval=-1 AND ng2<0
UNION
SELECT 0 true_positives, 0 true_negatives,COUNT(ydate) false_positives,0 false_negatives FROM oosd12 WHERE yval=1 AND ng2<0
UNION
SELECT 0 true_positives, 0 true_negatives,0 false_positives,COUNT(ydate) false_negatives FROM oosd12 WHERE yval=-1 AND ng2>0
) subq
;


-- Look at the distribution of prediction probabilities:

SELECT
yval
,COUNT(yval)
,MIN(probdown)
,AVG(probdown)
,MAX(probdown)
,MIN(probup)
,AVG(probup)
,MAX(probup)
FROM predictions
GROUP BY yval ORDER BY yval
;

-- Look for corr-tween probdown/probup and ng2
SELECT
-- I want to see neg-corr here:
CORR(probdown,ng2) corr_tween_probdown_ng2
-- I want to see positive-corr here:
,CORR(probup,ng2) corr_tween_probup_ng2
FROM oosd12
;

-- Look at avg ng2:
SELECT AVG(ng2) avg_ng2 FROM oosd12;
SELECT AVG(ng2) avg_ng2m1 FROM oosd12 WHERE yval = -1;

SELECT AVG(ng2) avg_ng2m1gt_avg
FROM oosd12
WHERE yval = -1
AND probdown >
  (SELECT AVG(probdown) FROM oosd12 WHERE yval = -1)
;


SELECT AVG(ng2) avg_ng2 FROM oosd12;
SELECT AVG(ng2) avg_ng2p1 FROM oosd12 WHERE yval = 1;

SELECT AVG(ng2) avg_ng2p1gt_avg
FROM oosd12
WHERE yval = 1
AND probup >
  (SELECT AVG(probup) FROM oosd12 WHERE yval = 1)
;

-- end

dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 


dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ psql -P pager=no -f ~/libsvm/pg/qry_predictions_j_oos.sql
DROP TABLE
SELECT 1477
 min | max  | count 
-----+------+-------
   1 | 1477 |  1477
(1 row)

 min | max  | count 
-----+------+-------
   1 | 1477 |  1477
(1 row)

DROP TABLE
SELECT 1477
 true_positives 
----------------
            796
(1 row)

 true_negatives 
----------------
             12
(1 row)

 false_positives 
-----------------
             642
(1 row)

 false_negatives 
-----------------
              25
(1 row)

      accuracy       
---------------------
 54.7489823609226594
(1 row)

 yval | count |    min    |          avg           |   max    |   min    |          avg           |   max    
------+-------+-----------+------------------------+----------+----------+------------------------+----------
   -1 |    37 |  0.505453 | 0.52261708108108108108 | 0.682562 | 0.317438 | 0.47738291891891891892 | 0.494547
    1 |  1440 | 0.0333862 | 0.44372265937500000000 |      0.5 |      0.5 | 0.55627734027777777778 | 0.966614
(2 rows)

 corr_tween_probdown_ng2 | corr_tween_probup_ng2 
-------------------------+-----------------------
      -0.116583632519959 |     0.116583619903836
(1 row)

          avg_ng2           
----------------------------
 0.000493447518968012233578
(1 row)

         avg_ng2m1          
----------------------------
 0.003437072559516228248204
(1 row)

     avg_ng2m1gt_avg     
-------------------------
 -0.00625108855798108744
(1 row)

          avg_ng2           
----------------------------
 0.000493447518968012233578
(1 row)

         avg_ng2p1          
----------------------------
 0.000417812708898370572091
(1 row)

       avg_ng2p1gt_avg       
-----------------------------
 0.001592327425966886587819
(1 row)

dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 
dan@libsvm_hp ~/libsvm/pg $ 


Results Discussion

I start the discussion by directing your attention to the query displayed below:


SELECT 
yval
,COUNT(yval)
,MIN(pnum)
,MAX(pnum)
,MIN(probup)
,MAX(probup)
,MIN(probdown)
,MAX(probdown)
FROM predictions
GROUP BY yval ORDER BY yval 
;

 yval | count | min | max  |   min    |   max    |    min    |   max    
------+-------+-----+------+----------+----------+-----------+----------
   -1 |    37 |  20 | 1429 | 0.317438 | 0.494547 |  0.505453 | 0.682562
    1 |  1440 |   1 | 1477 |      0.5 | 0.966614 | 0.0333862 |      0.5
(2 rows)


It's obvious that SVM had a strong tendency to predict that two-day gain would be positive. I'm not sure why that is. If I were to use the model which generated these predictions, I would throttle-forward the confidence level for negative predictions.

The queries I wrote to fill the confusion matrix drive this point home:


SELECT
COUNT(ydate) true_positives
FROM oosd12
WHERE yval = 1 AND ng2 > 0
;

 true_positives 
----------------
            796

SELECT
COUNT(ydate) true_negatives
FROM oosd12
WHERE yval = -1 AND ng2 <= 0
;

 true_negatives 
----------------
             12

SELECT
COUNT(ydate) false_positives
FROM oosd12
WHERE yval = 1 AND ng2 < 0
;
  
 false_positives 
-----------------
             642

SELECT
COUNT(ydate) false_negatives
FROM oosd12
WHERE yval = -1 AND ng2 > 0
;
 false_negatives 
-----------------
                25


Next, I called a convenient Postgres function named CORR(): http://www.postgresql.org/docs/9.2/static/functions-aggregate.html


-- Look for corr-tween probdown/probup and ng2
SELECT
-- I want to see neg-corr here:
CORR(probdown,ng2) corr_tween_probdown_ng2
-- I want to see positive-corr here:
,CORR(probup,ng2) corr_tween_probup_ng2
FROM oosd12
;


corr_tween_probdown_ng2 | corr_tween_probup_ng2 
------------------------+-----------------------
     -0.116583632519959 |     0.116583619903836


Above, it calculates the correlation coefficient between SVM prediction probabilities and two-day gain. I noticed a negative correlation between probdown-values and ng2. Also, I noticed a positive correlation between probup-values and ng2. The correlations were weak but they signified that SVM was exhibiting predictive power with this data.

Next, I compared AVG(ng2), the average of the normalized two-day gain, to AVG(ng2) where the joined prediction was -1.


SELECT AVG(ng2) avg_ng2 FROM oosd12;

          avg_ng2           
----------------------------
 0.000493447518968012233578


SELECT AVG(ng2) avg_ng2m1 FROM oosd12 WHERE yval = -1;

         avg_ng2m1          
----------------------------
 0.003437072559516228248204

SELECT AVG(ng2) avg_ng2m1gt_avg
FROM oosd12
WHERE yval = -1
AND probdown >
  (SELECT AVG(probdown) FROM oosd12 WHERE yval = -1)
;

     avg_ng2m1gt_avg     
-------------------------
 -0.00625108855798108744


In the first query, I noticed that AVG(ng2) was near zero: 0.000493 which, in percentage terms is about 0.05%.

For my perspective, I see a gain of 0.05% as equivalent to a gain of 5 cents on a $100 stock.

That might not seem like much but it can add up over a period of years.

Also this gain of 0.05% is a yard stick by which mutual funds and money managers are measured. Most mutual funds and money managers return less gain than SPY.

In the second query, I noticed that AVG(ng2) WHERE yval = -1, was 0.00343 which is 8.5 times higher than AVG(ng2).

This is BAD news. It is evidence of SVM issuing severe false negative predictions. Fortunately, it issued a small number of negative predictions. Of all 1477 predictions, only 37 were negative.

In the third query, I simulate how I the data scientist would react to false negative predictions. I would only follow negative predictions if the corresponding value of probdown was greater than average.

The result was -0.00625

I see that as equivalent to a gain of 62.5 cents on a $100 stock which is a 12.5 times better than average return. Also I see that as SVM offering some predictive power with this data.

Next, I compared AVG(ng2), to AVG(ng2) where the joined prediction was +1.


SELECT AVG(ng2) avg_ng2p1 FROM oosd12 WHERE yval = 1;
  
           avg_ng2p1          
  ----------------------------
   0.000417812708898370572091

SELECT AVG(ng2) avg_ng2p1gt_avg
FROM oosd12
WHERE yval = 1
AND probup >
  (SELECT AVG(probup) FROM oosd12 WHERE yval = 1)
;

       avg_ng2p1gt_avg       
-----------------------------
 0.001592327425966886587819


The first query tells me that up-predictions are not very predictive. The results were same as the average.

The second query, however, gave me confirmation that if I followed predictions which SVM labled more confident than average, my value of ng2 would be 0.00159.

This value of 0.00159 is about 3.2 times higher than average.

So, I've seen the results from two queries which have convinced me that SVM has significant predictive power when I follow predictions it has flagged as more confident than average.

I'm not sure this information is powerful enough to get rich from. But, it looks good enough to easily beat the average ng2 of SPY.

Summary




Page Top