Finding the sort order of an array in R or Ruby

Suppose you have an array that you’d like to sort by another array. A common use case might be a set of arrays of somethings and for each something you generate a score in say [0,1]. Now you’d like to sort your somethings by their scores.

Concretely, say you have an array of scores:

scores: [0.3347867, 0.9069004, 0.4391635, 0.8376249, 0.7133011]
indices: [0, 1, 2, 3, 4]

and you want the indices of the sorted scores, ie

scores_sorted: [0.3347867, 0.4391635, 0.7133011, 0.8376249, 0.9069004]
indices_sorted: [0, 2, 4, 3, 1]

in R, you can always use order, as in

$ R
> scores <- runif(5)
> perm <- order(scores)
> data.frame(score=scores, order=perm)
      score order
1 0.3347867     1
2 0.9069004     3
3 0.4391635     5
4 0.8376249     4
5 0.7133011     2
>
> # and just to check
> scores[ perm ]
[1] 0.3347867 0.4391635 0.7133011 0.8376249 0.9069004

You can do something similar in ruby:

irb > scores = [0.3347867, 0.9069004, 0.4391635, 0.8376249, 0.7133011]
=> [0.3347867, 0.9069004, 0.4391635, 0.8376249, 0.7133011]
irb > scores.zip( (1..scores.length).to_a )
=> [[0.3347867, 1], [0.9069004, 2], [0.4391635, 3], [0.8376249, 4], [0.7133011, 5]]
irb >
irb > scores.zip( (1..scores.length).to_a ).sort_by{ |e| e.first }
=> [[0.3347867, 1], [0.4391635, 3], [0.7133011, 5], [0.8376249, 4], [0.9069004, 2]]
irb >
irb > perm = scores.zip( (1..scores.length).to_a ).sort_by{ |e| e.first }.map{ |e| e[1] - 1 }
=> [0, 2, 4, 3, 1]
irb >
irb > scores.values_at(*perm)
=> [0.3347867, 0.4391635, 0.7133011, 0.8376249, 0.9069004]

And finally in C++, you can leverage qsort_r; this function was designed to be a reentrant / threadsafe qsort so you’re given a void* to pass a block of memory into your comparison function. You can use this to sort the indices array by the scores:

#include

// [...]
// utility fns that join an array of u (unsigned int) or f (double) into a string
char* vsprintf_u(char* buff, unsigned int* array, unsigned int len){
	char* orig = buff; buff += sprintf(buff, "["); for (int i=0; i < len; i++) buff += sprintf(buff, "%3u, ", array[i]); sprintf(buff, "]");
	return orig;
}
char* vsprintf_f(char* buff, double* array, unsigned int len){
	char* orig = buff;
	buff += sprintf(buff, "["); for (int i=0; i < len; i++) buff += sprintf(buff, "%1.4f, ", array[i]); sprintf(buff, "]");
	return orig;
}

/**
 * qsort_r comparison fn: sort array indices by scores
 */
int score_comparator(void* scoresv, const void* leftv, const void* rightv){
	unsigned int* left = (unsigned int*)leftv;
	unsigned int* right = (unsigned int*)rightv;
	double* scores = (double*)scoresv;

	if (scores[ *left ] < scores[ *right ])
		return -1;
	else if (scores[ *left ] == scores[ *right ])
		return 0;
	return 1;
}

// [...]

printf("test code:\n");
double scores[5] = {0.3347867, 0.9069004, 0.4391635, 0.8376249, 0.7133011};
unsigned int perm[] = {0, 1, 2, 3, 4};
char buff[4192];

printf("presort:  %s\n", vsprintf_f(buff, scores, 5));
printf("presort:  %s\n", vsprintf_u(buff, perm, 5));

qsort_r(perm, 5, sizeof(unsigned int), scores, &score_comparator);

printf("postsort: %s\n", vsprintf_u(buff, perm, 5));
printf("postsort: [");
for (unsigned int i=0; i < 5u; i++)
	printf("%1.4f, ", scores[ perm[ i ] ]);
printf("]\n");

which produces when run

$ ./a.out
presort:  [0.3348, 0.9069, 0.4392, 0.8376, 0.7133, ]
presort:  [  0,   1,   2,   3,   4, ]
postsort: [  0,   2,   4,   3,   1, ]
postsort: [0.3348, 0.4392, 0.7133, 0.8376, 0.9069, ]

Note the canonical way to sort somethings by a float in c++ is to bang everything into a struct or class and leverage qsort on the structs/classes directly. However, this is often pretty inconvenient, and if you have a lot of whatever you want to sort, it's too memory intensive to put everything into structs/classes with the sole addition of your score field.

I think it's obvious why I prefer to program in R.

NB: I am developing for OS X; if you are targeting linux you'll have to figure out how to link qsort_r yourself. I think someone also decided to permute the argument order. Sigh.

Posted in Programming, Programming Languages Suck, R, R Tip | Leave a comment

Getting the value of a variable from a string in R

It’s often convenient to use reflection to get the value of a variable from the name as a string. In R, you can use the get function to do this.

In R :

blog $ R
> x = 3
> get('x')
[1] 3
>

In Ruby:

blog $ irb
irb(main):001:0> x = 3
=> 3
irb(main):002:0> eval 'x'
=> 3

though Ruby’s eval is more general, and is equivalent to eval in R allowing you to evaluate arbitrary code in a string.

Posted in R, R Tip | Leave a comment

Windows still sucks, can’t read exfat formatted on a mac

Now that OS X Snow Leopard supports exfat / fat64 and Microsoft Windows theoretically does, you might naively assume that you can share external drives between Vista and Snow Leopard. This is true with one giant caveat: you can’t format the drive on your Mac. You have to format it first on your windows box due to some unknown issue from the morons from Redmond. Posted to hopefully save you the two hours I wasted debugging this. Of course, if you say, already have 300+GB of data on your external drive that you originally formatted on a mac, this knowledge isn’t the most helpful thing ever. Sigh.

Posted in Suck | Leave a comment

Prepping the Reuters 21578 classification sample dataset

I’ve been playing around with some topic models and decided to look at the Reuters 21578 dataset. For your convenience, this dataset is stored as xml split between 20 files or so. And invalid xml at that. I prefer to work with flat text files, so this bit of ruby turns the xml into a single file, slightly cleaned of trailing periods and commas, with one line per input document. Hopefully this will save someone some time, whereas I got to spend an hour or more trying to figure out how to strip invalid UTF-8. This works in ruby 1.9 with the obvious gems installed and the reuters dataset as of today.

# dump the reuters dataset from http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
# into a single file, one line per article

require 'hpricot'
require 'iconv'

# config
directory = 'reuters21578'
do_downcase = true
strip_reuter = true				# remove the reuter if it's the last elem of the array
strip_all_reuter = false		# remove all words that match reuter
strip_trailing_comma = true
strip_trailing_period = true

output = File.new('all.txt', 'w')
iconv = Iconv.new('UTF-8//IGNORE', 'UTF-8') # used to turn invalid utf-8 into valid

Dir.entries(directory).select{ |i| i=~ /sgm$/}.each do |filename|
	file = File.new("#{ directory }/#{ filename }", 'r').read
	xml = Hpricot(file)
	articles = xml.search('/REUTERS/*/BODY')

	puts "reading #{filename} : #{ articles.length}"

	articles.each{ |article|
		a = iconv.iconv(article.innerHTML)
		a = a.split(/\s/).map{ |i| i.chomp }.select{ |i| nil == (i =~ /^&/) }

		a.map!{ |i| i.downcase } if do_downcase
		a = a[0..-2] if strip_reuter and a.last =~ /reuter/i
		a.select!{ |i| nil == (i =~ /reuter/ ) } if strip_all_reuter

		a.map!{ |i| i.sub( /,$/, '') } if strip_trailing_comma
		a.map!{ |i| i.sub( /\.$/, '') } if strip_trailing_period
		a = a.select{ |i| i != '' }.join(' ')
		output.puts(a)
	}
end

output.close
Posted in Classifiers, Programming, Programming Languages Suck | Leave a comment

Thousands Separator in printf in C++

I’ve unfortunately been writing some C++. It’s the crappiest language in the world. I just wasted 90 perfectly good minutes attempting to put thousands separators in numbers that I’m printf ing. If you naively read the man pages, it looks easy — just include ‘ in your format specifier. Unfortunately, the hidden requirement is that you call this earlier in your code:

#include<locale.h>
[...]
setlocale(LC_ALL, "");

// then
unsigned int x = 12345u;
printf("%'8u\n", x);  // -> 12,345 as desired

Edit: I was developing under OS X with XCode 3.2.6.

$ sw_vers
ProductName:	Mac OS X
ProductVersion:	10.6.7
BuildVersion:	10J4138

I also verified this works under linux, or at least centos. The man page for printf doesn’t even mention apostrophe as an option, but it still works as long as you call setlocale.

$ uname -r
2.6.18-164.6.1.el5.centos.plus
$ cat /proc/version
Linux version 2.6.18-164.6.1.el5.centos.plus (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Wed Nov 4 09:31:39 EST 2009
Posted in Programming, Programming Languages Suck | Leave a comment

Watching Lecture Videos on Your Computer

I’ve recently been watching some of the lecture videos available on videolectures.net. The site is a great resource, but often the lecturers speak too slowly. I really prefer to watch lecture videos at a higher speed, otherwise I lose focus, try to multitask and browse the internet in the background, and end up retaining neither the lecture material nor the browsing.

Instead, you can use VLC or MPlayer, both available for OS X. In VLC, you can use

 apple key plus +/-

to speed or slow down video playback. You may have to go to preferences -> audio -> all -> enable time stretching audio so that the audio is resampled to preserve pitch so the speakers don’t sound like chipmunks.

In MPlayer,

 apple key + [/]

speeds or slows video playback.

Unfortunately, both of these techniques require downloading the video, and some sites, like video lectures, don’t directly allow you to do so. Fortunately, both VLC and MPlayer allow you to do so, though with some caveats, and both are a bit finicky. For example, both will die if your internet connection goes out and force you to restart download.

Nonetheless, in VLC, you can use the following shell script (a tip I found somewhere I can’t remember on the internet):

$ cat getV.sh
/Applications/VLC.app/Contents/MacOS/VLC -I rc $1 --sout $2 vlc://quit;
$
$ /get.sh "mms://[snip] " gradient01.wmv

whence you will see a bunch of spew ala

[0x10472e368] [rc] lua interface: Listening on host "*console".
VLC media player 1.1.9 The Luggage
Remote control interface initialized. Type `help' for help.
> [0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: cannot connect to server
[0x10473ca88] access_mms access error: cannot read data 2
[0x10473c7c8] mux_avi mux: stream[0] duration:3520 totalsize:316578836 frames:88018 fps:25.000000 KiB/s:702
[0x10473c7c8] mux_avi mux: stream[1] duration:3520 totalsize:28180137 frames:18951 fps:5.383371 KiB/s:62
[0x105f1ce18] dummy demux: command `quit'

and in mplayer

$ cat getM.sh
  mplayer -dumpstream -dumpfile $2 "$1"

eg

$ ./getM.sh "[snip mms url here]" test.wmv

Finally, in order to use these to download lectures from videolectures, go to the page of a lecture you wish to watch, view source, and find the link that starts with “mms://”. I’ve had better luck using VLC than mplayer.

Posted in Uncategorized | Leave a comment

Horizontal Paging of Greenplum or Postgres Queries

When using gpsql or pgsql to query greenplum or postgres respectively, query results which exceed the width of your term will wrap in a very annoying fashion. To get horizontal paging, set the environmental variable PAGER:

export PAGER='less -RSFX'

then either in your psql or gpsql session, or in your .psqlrc file,

\pset pager always

Note that if you just want the pager for long output, not wide, you can omit the always from the pset command.

Posted in Data Munging | Tagged , | Leave a comment

Interactive Plotting in R

There are many ways to compare univariate distributions; one of my favorites is violin plots. However, if you are only comparing two distributions, then the best solution is often a scatter plot. To that end, I’ve build some code that creates an interactive scatter plot of two distributions and allows you to interactively print arbitrary strings on the graph when you select / deselect points. This creates a slightly kludgy but very handy tool for hand comparing distributions.

Unfortunately, truly interactive plotting isn’t really a part of R and you are thus forced to lean on external tools. I picked JGR, the java gui for R. This is best used by getting the JGR launch tool.

Basically, I have data with multiple tests; a single line shows the results for one item across several tests. I wish to compare the distributions.

> head(age)
    name    default      test1      test2
1 item 1 0.02110710 0.01900870 0.02030870
2 item 2 0.03160770 0.02926650 0.03345660
3 item 3 0.03909570 0.03702500 0.04016650
4 item 4 0.00262195 0.00225917 0.00302822
5 item 5 0.01668860 0.01555010 0.01783400
6 item 6 0.04223370 0.03904630 0.04123270

test data

You can use this function to throw up a window, and allow you to draw a box around items to see their information displayed in the upper left.

  library('iplots')

  visCompare <- function(dat, xname, yname){  

    # override this to display your preferred text
    makeDispString <- function(row){
      sprintf('%s : %s = %0.3f; %s = %0.3f; diff = %0.3f', row$name,
        xname, row[[xname]], yname, row[[yname]], row[[xname]] - row[[yname]])
    }

    ypoint <- 0.05 + max(dat[[yname]])

    iplot(x=dat[[xname]], y=dat[[yname]], xlab=xname, ylab=yname,
      ylim=c(0, ypoint + 0.05), xlim=c(0, max(dat[[xname]])), lwd=2)

    iabline(coef=c(0,1))
    d <- iplot.data()
    cat('Select break from the menu to exit loop')

    txtObj <- NULL

    while (!is.null(ievent.wait())){
      if (iset.sel.changed()){
        cat("sel changed\n")
        s <- iset.selected()

        if (length(s) >= 1){
          if (!is.null(txtObj) ){
            iobj.rm( txtObj )
          }

          aa <- paste( makeDispString(dat[s[1:min(3, length(s))],]), collapse="\n")
          cat(paste(aa, "\n"))
          txtObj <- itext(x=0, y=ypoint, labels=aa)
        }

      } else {
        if ( !is.null(txtObj)){

          cat(paste('removing ', txtObj, "\n"))
          iobj.rm( txtObj )
          txtObj <- NULL
        }
      }

    }
  }

To test, you can use these two bits of code:

if (F){
	read.csv(file='iplot.test.csv.txt', header=T, sep=',')
	visCompare(age, 'default', 'test2')
}
if (F){
	read.csv(file='iplot.test.csv.txt', header=T, sep=',')
	myDispFn <- function(a){ return(paste(a$name, 'blah blah', sep=' : ') }
	visCompare(age, 'default', 'test2', myDispFn)
}

And here are the results: first, a visual check via scatterplot of the differences of the two distributions:

and with the ability to highlight points and see what you're looking at:

Posted in Data Munging, Plotting, R, R Tip, Visualization | Leave a comment

Querying Postgres or Greenplum From R on a Mac, Installation Instructions

NB: this works on 64b versions of R; I tested it with the R64 app with R version 2.10.1 on Snow Leopard

Step by step instructions for talking to Postgres or Greenplum:

  1. install macports
  2. install postgres; I used 8.4
    sudo port install postgresql84
    

  3. in a shell, create an environmental variable PG_CONFIG pointing to the pg_config binary installed by postgres. In my installation, this is something like
    export PG_CONFIG=/opt/local/lib/postgresql84/bin/pg_config
    

  4. in the same shell, tell R to install the RPostgreSQL package *from source*, ie
    $ R
    > install.packages('RPostgreSQL', type='source')
    

  5. test the installation works:
    > library('RPostgreSQL')
    Loading required package: DBI
    > drv <- dbDriver('PostgreSQL')
    > db <- dbConnect(drv, host='greenplum.ip', user='earl', dbname='dbname')
    > dbGetQuery(db, 'select 1')
    ?column?
    1       1
    

Diagnosing error messages / problems:

  • If R says
    Warning message:
    In install.packages("RPostgreSQL") : package ‘RPostgreSQL’ is not available
    


    you must specify to install the package from source, as above with type=’source’

  • If you get compilation errors when installing the package that mention libpq-fe.h, then R can’t find pg_config
  • if the package installs but when loading it you get errors involving missing symbol _PQbackendPID then you are mixing 32 and 64 bit software.

Follow the links for instructions to fix your problems.

Posted in Data Munging, R, R Tip | Tagged , , | Leave a comment

Querying Databases From R on a Mac

I use a mac, currently running OS 10.6 / Snow Leopard, and I’d like to query our greenplum / postgres database from R. This used to work with R 2.9, but I unfortunately had to upgrade R, and R 2.10 on the mac is a 64 bit app. So, I want to use either RODBC or RPostgreSQL packages under 64 bit R on a mac to query postgres / greenplum.

First, I tried just installing RPostgreSQL as before. Unfortunately, I started getting weird errors when I attempted to load the package:

>library('RPostgreSQL')
Loading required package: DBI
Error in dyn.load(file, DLLpath = DLLpath, ...) :
  unable to load shared library '/Library/Frameworks/R.framework/Resources/library/RPostgreSQL/libs/x86_64/RPostgreSQL.so':
  dlopen(/Library/Frameworks/R.framework/Resources/library/RPostgreSQL/libs/x86_64/RPostgreSQL.so, 6): Symbol not found: _PQbackendPID
  Referenced from: /Library/Frameworks/R.framework/Resources/library/RPostgreSQL/libs/x86_64/RPostgreSQL.so
  Expected in: flat namespace
 in /Library/Frameworks/R.framework/Resources/library/RPostgreSQL/libs/x86_64/RPostgreSQL.so
Error: package/namespace load failed for 'RPostgreSQL'

The key bit of the error message is the missing symbol: _PQbackendPID. Some googling suggested this could be caused by mixing 32 and 64 bit libs. I used file to check and yes, indeed, I had a 32 bit version of Postgres that was refusing to talk to a 64 bit version on R. Suck.

In brief, the solution is to use ports to install postgres — in this case, postgres 8.4 as such:

sudo port install postgres84

you can use the file command to see what architecture your installed postgres is configured as:

laptop:src earl$ file `echo $PG_CONFIG`
/opt/local/lib/postgresql84/bin/pg_config: Mach-O 64-bit executable x86_64

checking, my previous postgres 8.4 install, from the Postgres Plus prebuild package, produces

file /Library/PostgresPlus/8.4SS/bin/pg_config
/Library/PostgresPlus/8.4SS/bin/pg_config: Mach-O universal binary with 2 architectures
/Library/PostgresPlus/8.4SS/bin/pg_config (for architecture ppc):	Mach-O executable ppc
/Library/PostgresPlus/8.4SS/bin/pg_config (for architecture i386):	Mach-O executable i386


Notice the lack of any 64bit support.

Then open a terminal, set the PG_CONFIG environmental variable to point to the right location, then run R from the terminal and install the package.

laptop: work earl$ export PG_CONFIG=/opt/local/lib/postgresql84/bin/pg_config

laptop: work earl$ R64
install.packages('RPostgreSQL', type='source')

If you have misconfigured the pg_config, this is the relevant bit of the compilation error message you will receive:

checking for "/libpq-fe.h"... no
configure: error: File libpq-fe.h not in ; installation may be broken.
ERROR: configuration failed for package ‘RPostgreSQL’
* removing ‘/Library/Frameworks/R.framework/Versions/2.10/Resources/library/RPostgreSQL’

Otherwise, RPostgreSQL will compile and install. Seriously, though, there *must* be a better way of distributing software on macs.

Posted in Data Munging, R, R Tip | Tagged , , , | Leave a comment