Saturday, December 04, 2010

A more efficient piece of code

CMIE's firm databases use a fine-grained product code to identify each product. Each firm is also allocated to a product code based on its predominant activities. I like to reconstruct a coarse classification out of this that suits my tastes. I do this using this R function:

cmie.14.industries <- function(s) {
  values.8 <- c("Food","Textiles",
                 "Chemicals","NonMetalMin",
                "Metals","Machinery",
                 "TransportEq","MiscManuf",
                "Diversified","Serv.IT")
  names(values.8) <- c("01010101", "01010102",
                       "01010103", "01010104",
                       "01010105", "01010106",
                       "01010107", "01010108",
                       "01010109","01010408")
  values.6 <- c("Serv.Construction","Serv.Other",
                 "Mining","Electricity")
  names(values.6) <- c("010106","010104","010102",
                       "010103")

  if (is.na(s)) {return(NA)}

  leading8 <- substr(s, 1, 8)
  attempt <- values.8[leading8]
  if (!is.na(attempt)) {return(attempt)}

  leading6 <- substr(s, 1, 6)
  attempt <- values.6[leading6]
  if (!is.na(attempt)) {return(attempt)}

  leading4 <- substr(s, 1, 4)
  if (leading4 == "0102") {return("Serv.Finance")}

  return("MISTAKE")
}

This maps each firm into one of 14 coarse categories. Here are some examples of this in action:

> cmie.14.industries("0102090000000000")
"Serv.Finance"
> cmie.14.industries("0101041502000000")
"Serv.Other" 
> cmie.14.industries("0101010601010000")
"Machinery"

So in short, the function cmie.14.industries() maps a string like "0101010601010000" into a set of 14 broad industry names such as "Machinery".

Faced with a file with roughly 48,000 firm-years, at first blush, it seems that this function has to be run 48,000 times. For a given firm, this classification could change over time, so it isn't just a matter of doing this once for each firm. Here is one simple way to do it:

badway <- function(task) {
  result <- rep("", length(task))
  for (i in 1:length(task)) {
    result[i] <- cmie.14.industries(task[i])
  }
  result
}

This is just a loop that runs over everything in the supplied vector and calls cmie.14.industries() for each element. The only concession to efficiency is that the empty vector `result' is allocated ahead of time.

This proves to be quite slow. None of the standard R vectorisation ideas offer much relief.

The key idea for obtaining a leap in performance was that while I had to run through 48,000 firm-years, the industry codes actually attain only a modest list of possibilities. This makes possible a table lookup:

goodway <- function(task) {
  possibilities <- unique(task)
  values <- rep("", length(possibilities))
  for (i in 1:length(possibilities)) {
    values[i] <- cmie.14.industries(possibilities[i])
  }
  names(values) <- possibilities
  values[task]
}

For a problem of size 1000, this works out to be 13.5 times faster:

> load("task.rda")
> length(task)
[1] 1000
> system.time(res1 <- badway(task))
   user  system elapsed 
  0.030   0.000   0.031 
> system.time(res2 <- goodway(task))
   user  system elapsed 
  0.002   0.000   0.002 

This is just a demo with a 1000-sized task. In my production situation, the performance difference is even greater, since badway() calls cmie.14.industries() 48,000 times while goodway() only calls it a few hundred times.

2 comments:

  1. Why was it important to put in all that effort to reduce the time of the program from 0.31 to 0.02 seconds? Perhaps this is a subroutine in a larger program.

    ReplyDelete
  2. It's from 0.03 to 0.002 seconds for a 1000-sized problem. In production I'm doing 48,000 firm-years so it's a bit bigger than that.

    It was a neat idea, one that actually occurs more often than we think. We want to compute f(x) where x is a vector, but many values in x are repeated so there's no need to evaluate f() at every element of x. Hence I thought it's useful to write about it.

    ReplyDelete

Please note: Comments are moderated; I will delete comments that misbehave. The rules are as follows. Only civilised conversation is permitted on this blog. Criticising me is perfectly okay; uncivilised language is not. I delete any comment which is spam, has personal attacks against anyone, or uses foul language.