32-bit OS だと bigmemory で 2 GiB 以上のファイルを扱えないらしい

2013-06-23

全ての 32-bit OS に言えるかは定かではないですが、おそらくほとんどに当てはまるのではないかと思います。
試した環境はさくら VPS 2 GB にカスタム OS インストールで CentoOS 6 i386 をインストールしたものです。

# uname -a
Linux www10111uj.sakura.ne.jp 2.6.32-358.11.1.el6.i686 #1 SMP Wed Jun 12 01:01:27 UTC 2013 i686 i686 i386 GNU/Linux
# R --version
R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: i686-redhat-linux-gnu (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

bigmemory を使って airline.csv (約 11 GiB) から行列を作成してみます。

# R -q
> packageVersion("bigmemory")
[1] '4.4.3'
> library(bigmemory)
Loading required package: bigmemory.sri
Loading required package: BH

bigmemory >= 4.0 is a major revision since 3.1.2; please see packages
biganalytics and and bigtabulate and http://www.bigmemory.org for more information.

> file.info("./data/airline.csv")
                          size isdir mode               mtime
./data/airline.csv 11626601878 FALSE  644 2013-06-22 23:38:45
                                 ctime               atime uid gid uname grname
./data/airline.csv 2013-06-23 02:11:05 2013-06-23 14:36:11   0   0  root   root
> x <- read.big.matrix("./data/airline.csv", type = "integer", header = TRUE,
+  backingpath = "./data", backingfile = "airline.bin", descriptorfile = "airline.desc", extraCols = "Age")
Warning: stack imbalance in '.Call', 24 then 25
Warning: stack imbalance in '-', 23 then 24
Warning: stack imbalance in '-', 22 then 23
Warning: stack imbalance in '<-', 20 then 21
Error in filebacked.big.matrix(nrow = nrow, ncol = ncol, type = type,  :
  A big.matrix must have at least one row and one column
> traceback()
5: stop("A big.matrix must have at least one row and one column")
4: filebacked.big.matrix(nrow = nrow, ncol = ncol, type = type,
       init = init, dimnames = dimnames, separated = separated,
       backingfile = backingfile, backingpath = backingpath, descriptorfile = descriptorfile)
3: big.matrix(nrow = numRows, ncol = createCols, type = type, dimnames = list(rowNames,
       colNames), init = NULL, separated = separated, backingfile = backingfile,
       backingpath = backingpath, descriptorfile = descriptorfile,
       shared = TRUE)
2: read.big.matrix("./data/airline.csv", type = "integer", header = TRUE,
       backingpath = "./data", backingfile = "airline.bin", descriptorfile = "airline.desc",
       extraCols = "Age")
1: read.big.matrix("./data/airline.csv", type = "integer", header = TRUE,
       backingpath = "./data", backingfile = "airline.bin", descriptorfile = "airline.desc",
       extraCols = "Age")

何やら変な警告が出て、filebacked.big.matrix() でエラーが出ているのがわかります。
警告の内容は PROTECT を実行したのに対応する UNPROTECT が実行されていないという意味らしく、問題になっているのは次の関数のようです。

SEXP CCountLines(SEXP fileName)
{ 
  FILE *FP;
  double lineCount = 0;
  char readChar;
  FP = fopen(STRING_VALUE(fileName), "r");
  SEXP ret = PROTECT(NEW_NUMERIC(1));
  NUMERIC_DATA(ret)[0] = -1;                   
  if (FP == NULL) return(ret);
  do {
    readChar = fgetc(FP);
    if ('\n' == readChar) ++lineCount;
  } while( readChar != EOF );
  fclose(FP);
  NUMERIC_DATA(ret)[0] = lineCount; 
  UNPROTECT(1);                  
  return(ret);
}

FP が NULL の場合に UNPROTECT が実行されていないので、fopen に失敗しているということがわかります。
次のコードでどんなエラーが出ているか確認してみます。

// open.c
#include <stdio.h>
#include <errno.h>
#include <string.h>

int main(int argc, char *argv[])
{
    FILE *fp = fopen(argv[1], "r");
    if (fp == NULL) {
        printf("Error: %s\n", strerror(errno));
    }
    return 0;
}

コンパイル & 実行してみます。

# gcc -o open open.c
# ./open ./data/airline.csv
Error: Value too large for defined data type

このエラー、どうも 32-bit OS で 2 GiB 以上のファイルを開こうとした時に起こるらしいです。

# dd if=/dev/zero of=2gib count=$((2*1024*1024)) bs=1024
2097152+0 records in
2097152+0 records out
2147483648 bytes (2.1 GB) copied, 6.44595 s, 333 MB/s
# ./open 2gib
Error: Value too large for defined data type

っで、これに対処するためには例えば次のようにコンパイルオプションを追加して LFS (Large File Support) を有効にすると良いみたいです。

# gcc -o open open.c $(getconf LFS_CFLAGS)
# ./open 2gib
#

というわけで、bigmemory に関してもコンパイルオプションを追加すれば良さそうです。

# Rscript -e 'options(repos = "http://cran.md.tsukuba.ac.jp"); download.packages("bigmemory", ".")'
trying URL 'http://cran.md.tsukuba.ac.jp/src/contrib/bigmemory_4.4.3.tar.gz'
Content type 'application/x-gzip' length 186848 bytes (182 Kb)
opened URL
==================================================
downloaded 182 Kb

     [,1]        [,2]
[1,] "bigmemory" "./bigmemory_4.4.3.tar.gz"
# tar xf bigmemory_4.4.3.tar.gz
# sed -i -e 's/DLINUX/DLINUX $(getconf LFS_CFLAGS)/' bigmemory/configure
# R CMD INSTALL bigmemory

再度実行してみます。

# R -q
> library(bigmemory)
Loading required package: bigmemory.sri
Loading required package: BH

bigmemory >= 4.0 is a major revision since 3.1.2; please see packages
biganalytics and and bigtabulate and http://www.bigmemory.org for more information.

> x <- read.big.matrix("./data/airline.csv", type = "integer", header = TRUE,
+  backingpath = "./data", backingfile = "airline.bin", descriptorfile = "airline.desc", extraCols = "Age")
Error in filebacked.big.matrix(nrow = nrow, ncol = ncol, type = type,  :
  Problem creating filebacked matrix.

うーん・・・ダメですね。まぁ今さら 32-bit OS に対応する必要はないですよね。