<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Stochastic Nonsense &#187; data frame</title>
	<atom:link href="http://blog.earlh.com/index.php/tag/data-frame/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.earlh.com</link>
	<description></description>
	<lastBuildDate>Mon, 19 Sep 2011 03:30:21 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Querying Databases in R</title>
		<link>http://blog.earlh.com/index.php/2009/08/querying-databases-in-r/</link>
		<comments>http://blog.earlh.com/index.php/2009/08/querying-databases-in-r/#comments</comments>
		<pubDate>Fri, 14 Aug 2009 16:00:36 +0000</pubDate>
		<dc:creator>earl</dc:creator>
				<category><![CDATA[Data Munging]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[R Tip]]></category>
		<category><![CDATA[data frame]]></category>
		<category><![CDATA[greenplum]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[postgres]]></category>
		<category><![CDATA[R and Databases]]></category>
		<category><![CDATA[R Tips]]></category>

		<guid isPermaLink="false">http://blog.earlh.com/?p=449</guid>
		<description><![CDATA[One of the first things you&#8217;ll want to do in R is set it up to talk to databases. The easiest way to do this is using ODBC, via package RODBC. To get the package, run > install.packages(RODBC) Once you &#8230; <a href="http://blog.earlh.com/index.php/2009/08/querying-databases-in-r/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>One of the first things you&#8217;ll want to do in R is set it up to talk to databases.  The easiest way to do this is using ODBC, via package RODBC.</p>
<p>To get the package, run<br />
<code>
<pre class="brush:text;">
> install.packages(RODBC)
</pre>
<p></code></p>
<p>Once you have RODBC installed, you call it in R as follows.  But it&#8217;s very simple: a bit of setup, then sqlQuery will run your sql and return the results in a data frame.<br />
<code>
<pre class="brush: text;">
library(RODBC)

db <- odbcConnect( dsn='your dsn name' )
sql <- 'select page_id, count(*) as cnt
           from document_ads
           group by page_id
           having count(*) > 1'

results <- sqlQuery(db, sql, errors=T, rows_at_time=1024)
str(results)
'data.frame':	282432 obs. of  2 variables:
 $ page_id: int  17646774 17115332 17606022 15899428 17099174 17283774 8604200 16315025 17259751 17283270 ...
 $ cnt            : int  489 1119 132 113 148 200 112 121 1135 633 ...
</pre>
<p></code></p>
<p>On Windows, you setup the DSNs in the ODBC Data Sources inside the control panel; on MacOS, mysql includes a program called ODBC Administrator; on linux, you'll have to install <a href="http://www.easysoft.com/developer/interfaces/odbc/linux.html"> unixODBC </a>.</p>
<p>Also, it's often convenient to write code that caches your query results, particularly if the query takes a while.  I've found that the easiest thing to do is write the results into a data file and check for the file existence like such:<br />
<code>
<pre class="brush:text;">
filename <- 'query cache.RData'
if (!file.exists(filename)){
   # don't have a cached copy so run the query
   library(RODBC)
   [snip]
   query1 <- sqlQuery(db, sql, errors=T, rows_at_time=1024)

   # save the query results for the future
   save(list=c('query1', 'sql'), file=filename)
   rm(list=c('query1', 'sql') )
}
load(file=filename)
</pre>
<p></code</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.earlh.com/index.php/2009/08/querying-databases-in-r/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Examining Data Frames &#8212; head and tail</title>
		<link>http://blog.earlh.com/index.php/2009/08/examining-data-frames-head-and-tail/</link>
		<comments>http://blog.earlh.com/index.php/2009/08/examining-data-frames-head-and-tail/#comments</comments>
		<pubDate>Sun, 02 Aug 2009 07:30:39 +0000</pubDate>
		<dc:creator>earl</dc:creator>
				<category><![CDATA[Data Munging]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[R Tip]]></category>
		<category><![CDATA[data frame]]></category>

		<guid isPermaLink="false">http://blog.earlh.com/?p=367</guid>
		<description><![CDATA[head and tail, for those familiar with the unix command line, are two very handy utilities for looking at data frames. Along with str, which displays the structure of a data frame, they help you look at your data: > &#8230; <a href="http://blog.earlh.com/index.php/2009/08/examining-data-frames-head-and-tail/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>head and tail, for those familiar with the unix command line, are two very handy utilities for looking at data frames.  Along with str, which displays the structure of a data frame, they help you look at your data:</p>
<p><code>
<pre class="brush:text;">
> d <- data.frame(mean=rep(1:10,5), val = rnorm(n=50, mean=rep(1:10,5)))
> d <- d[ order(d$mean), ]
>
> str(d)
'data.frame':	50 obs. of  2 variables:
 $ mean: int  1 1 1 1 1 2 2 2 2 2 ...
 $ val : num  2.303 2.222 -1.153 1.795 -0.232 ...
>
> head(d)
   mean        val
1     1  2.3026422
11    1  2.2216371
21    1 -1.1533163
31    1  1.7945563
41    1 -0.2318763
2     2 -0.4994239
>
> tail(d)
   mean       val
49    9  8.462525
10   10 10.437314
20   10 10.815264
30   10 10.218853
40   10  9.754245
50   10  9.596825
>
</pre>
<p></code> </p>
<p>If you are familiar with data frames, you&#8217;ll know that head(d) is no different than displaying the first 6 rows via subsetting, eg, d[1:6, ] but it saves some typing.  tail, on the other hand, saves us from either asking first how many rows a data frame has with nrow, or typing a mess: d[ (nrow(d)-5):nrow(d), ]<br />
<code>
<pre class="brush:text;">
> nrow(d)
[1] 50
> d[45:50,]
   mean       val
49    9  8.462525
10   10 10.437314
20   10 10.815264
30   10 10.218853
40   10  9.754245
50   10  9.596825
> d[ (nrow(d)-5):nrow(d), ]
   mean       val
49    9  8.462525
10   10 10.437314
20   10 10.815264
30   10 10.218853
40   10  9.754245
50   10  9.596825
>
</pre>
<p></brush></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.earlh.com/index.php/2009/08/examining-data-frames-head-and-tail/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

