トップ 差分 一覧 ソース 検索 ヘルプ RSS ログイン

BugTrack-R備忘録/53

R備忘録 /状態空間モデリング/donlp2/その他のメモ

R備忘録 - 記事一覧

Writing R Extensions(Version 2.7.1 (2008-06-23)):::5 システムと外部言語へのインターフェース

  • 投稿者: みゅ
  • カテゴリ: なし
  • 優先度: 普通
  • 状態: 完了
  • 日時: 2010年01月10日 15時16分01秒

内容

  • 適当訳、参照は自己責任でお願いします・・・

5.1 オペレーティングシステムへのアクセス

5.2 インターフェース関数 .C と .Fortran

  • These two functions provide a standard interface to compiled code that has been linked into R, either at build time or via dyn.load (see Section 5.3 [dyn.load and dyn.unload], page 62). They are primarily intended for compiled C and FORTRAN 77 code respectively, but the .C function can be used with other languages which can generate C interfaces, for example C++ (see Section 5.6 [Interfacing C++ code], page 66).
  • The first argument to each function is a character string given the symbol name as known to C or FORTRAN, that is the function or subroutine name. (That the symbol is loaded can be tested by, for example, is.loaded("cg"): it is no longer necessary nor correct to use symbol.For, which is defunct as from R 2.5.0.) (Note that the underscore is not a valid character in a FORTRAN 77 subprogram name, and on versions of R prior to 2.4.0 .Fortran may not correctly translate names containing underscores.)
  • There can be up to 65 further arguments giving R objects to be passed to compiled code. Normally these are copied before being passed in, and copied again to an R list object when the compiled code returns. If the arguments are given names, these are used as names for the components in the returned list object (but not passed to the compiled code).
  • The following table gives the mapping between the modes of R vectors and the types of arguments to a C function or FORTRAN subroutine.
R storage modeC typeFORTRAN type
logicalint *INTEGER
integerint *INTEGER
doubledouble *DOUBLE PRECISION
complexRcomplex *DOUBLE COMPLEX
characterchar **CHARACTER*255
rawunsigned char *none
  • Do please note the first two. On the 64-bit Unix/Linux platforms, long is 64-bit whereas int and INTEGER are 32-bit. Code ported from S-PLUS (which uses long * for logical and integer) will not work on all 64-bit platforms (although it may appear to work on some). Note also that if your compiled code is a mixture of C functions and FORTRAN subprograms the argument types must match as given in the table above.
  • C type Rcomplex is a structure with double members r and i defined in the header file ‘R_ext/Complex.h’ included by ‘R.h’. (On most platforms which have it, this is compatible withe C99 double complex type.) Only a single character string can be passed to or from FORTRAN, and the success of this is compiler-dependent. Other R objects can be passed to .C, but it is better to use one of the other interfaces. An exception is passing an R function for use with call_R, when the object can be handled as void * en route to call_R, but even there .Call is to be preferred. Similarly, passing an R list as an argument to a C routine should be done using the .Call interface. If one does use the .C function to pass a list as an argument, it is visible to the routine as an array in C of SEXP types (i.e., SEXP *). The elements of the array correspond directly to the elements of the R list. However, this array must be treated as read-only and one must not assign values to its elements within the C routine ― doing so bypasses R’s memory management facilities and will corrupt the object and the R session.
  • It is possible to pass numeric vectors of storage mode double to C as float * or to FORTRAN as REAL by setting the attribute Csingle, most conveniently by using the R functions as.single, single or mode. This is intended only to be used to aid interfacing to existing C or FORTRAN code.
  • Unless formal argument NAOK is true, all the other arguments are checked for missing values NA and for the IEEE special values NaN, Inf and -Inf, and the presence of any of these generates an error. If it is true, these values are passed unchecked.
  • Argument DUP can be used to suppress copying. It is dangerous: see the on-line help for arguments against its use. It is not possible to pass numeric vectors as float * or REAL if DUP=FALSE, and character vectors cannot be used.
  • Argument PACKAGE confines the search for the symbol name to a specific shared object (or use "base" for code compiled into R). Its use is highly desirable, as there is no way to avoid two package writers using the same symbol name, and such name clashes are normally sufficient to cause R to crash. (If it is not present and the call is from the body of a function defined in a package with a name space, the shared object loaded by the first (if any) useDynLib directive will be used.)
  • For .C only you can specify an ENCODING argument: this requests that (unless DUP = FALSE) character vectors be re-encoded to the requested encoding before being passed in, and re-encoded from the requested encoding when passed back. Note that encoding names are not standardized, and not all R builds support re-encoding. (The argument is ignored with a warning if re-encoding is not supported at all: R code can test for this via capabilities("iconv").) But this can be useful to allow code to work in a UTF-8 locale by specifying ENCODING = "latin1".
  • Note that the compiled code should not return anything except through its arguments: C functions should be of type void and FORTRAN subprograms should be subroutines.
  • To fix ideas, let us consider a very simple example which convolves two finite sequences. (This is hard to do fast in interpreted R code, but easy in C code.) We could do this using .C by
void convolve(double *a, int *na, double *b, int *nb, double *ab)
{
    int i, j, nab = *na + *nb - 1;
    for(i = 0; i < nab; i++)
        ab[i] = 0.0;
    for(i = 0; i < *na; i++)
        for(j = 0; j < *nb; j++)
            ab[i + j] += a[i] * b[j];
}
  • called from R by
conv <- function(a, b)
     .C("convolve",
    as.double(a),
    as.integer(length(a)),
    as.double(b),
    as.integer(length(b)),
    ab = double(length(a) + length(b) - 1))$ab
  • Note that we take care to coerce all the arguments to the correct R storage mode before calling .C; mistakes in matching the types can lead to wrong results or hard-to-catch errors.
  • Special care is needed in handling character vector arguments in C (or C++). Since only DUP = TRUE is allowed, on entry the contents of the elements are duplicated and assigned to the elements of a char ** array, and on exit the elements of the C array are copied to create new elements of a character vector. This means that the contents of the character strings of the char ** array can be changed, including to \0 to shorten the string, but the strings cannot be lengthened. It is possible to allocate a new string via R_alloc and replace an entry in the char ** array by the new string. However, when character vectors are used other than in a read-only way, the .Call interface is much to be preferred.
  • Passing character strings to FORTRAN code needs even more care, and should be avoided where possible. Only the first element of the character vector is passed in, as a fixed-length (255) character array. Up to 255 characters are passed back to a length-one character vector. How well this works (or even if it works at all) depends on the C and FORTRAN compilers on each platform.

5.3 dyn.load と dyn.unload

  • Compiled code to be used with R is loaded as a shared object (Unix and MacOS X, see Section 5.5 [Creating shared objects], page 65 for more information) or DLL (Windows).
  • The shared object/DLL is loaded by dyn.load and unloaded by dyn.unload. Unloading is not normally necessary, but it is needed to allow the DLL to be re-built on some platforms, including Windows.
  • The first argument to both functions is a character string giving the path to the object. Programmers should not assume a specific file extension for the object/DLL (such as ‘.so’) but use a construction like
file.path(path1, path2, paste("mylib", .Platform$dynlib.ext, sep=""))
  • for platform independence. On Unix-alike systems the path supplied to dyn.load can be an absolute path, one relative to the current directory or, if it starts with ‘~’, relative to the user’s home directory.
  • Loading is most often done via a call to library.dynam in the .First.lib function of a package. This has the form
library.dynam("libname", package, lib.loc)
  • where libname is the object/DLL name with the extension omitted. Note that the first argument, chname, should not be package since this will not work if the package is installed under another name (as it will be with a versioned install).
  • Under some Unix-alike systems there is a choice of how the symbols are resolved when the object is loaded, governed by the arguments local and now. Only use these if really necessary: in particular using now=FALSE and then calling an unresolved symbol will terminate R unceremoniously.
  • R provides a way of executing some code automatically when a object/DLL is either loaded or unloaded. This can be used, for example, to register native routines with R’s dynamic symbol mechanism, initialize some data in the native code, or initialize a third party library. On loading a DLL, R will look for a routine within that DLL named R_init_lib where lib is the name of the DLL file with the extension removed. For example, in the command
library.dynam("mylib", package, lib.loc)
  • R looks for the symbol named R_init_mylib. Similarly, when unloading the object, R looks for a routine named R_unload_lib, e.g., R_unload_mylib. In either case, if the routine is present, R will invoke it and pass it a single argument describing the DLL. This is a value of type DllInfo which is defined in the ‘Rdynload.h’ file in the ‘R_ext’ directory.
  • The following example shows templates for the initialization and unload routines for the mylib DLL.
#include <R.h>
#include <Rinternals.h>
#include <R_ext/Rdynload.h>
void
R_init_mylib(DllInfo *info)
{
    /* Register routines, allocate resources. */
}
void
R_unload_mylib(DllInfo *info)
{
    /* Release resources. */
}
  • If a shared object/DLL is loaded more than once the most recent version is used. More generally, if the same symbol name appears in several libraries, the most recently loaded occurrence is used. The PACKAGE argument provides a good way to avoid any ambiguity in which occurrence is meant.

5.4 ネイティブルーティンの登録

  • By ‘native’ routine, we mean an entry point in compiled code.
  • In calls to .C, .Call, .Fortran and .External, R must locate the specified native routine by looking in the appropriate shared object/DLL. By default, R uses the operating system-specific dynamic loader to lookup the symbol. Alternatively, the author of the DLL can explicitly register routines with R and use a single, platform-independent mechanism for finding the routines in the DLL. One can use this registration mechanism to provide additional information about a routine, including the number and type of the arguments, and also make it available to R programmers under a different name. In the future, registration may be used to implement a form of “secure” or limited native access.
  • To register routines with R, one calls the C routine R_registerRoutines. This is typically done when the DLL is first loaded within the initialization routine R_init_dll name described in Section 5.3 [dyn.load and dyn.unload], page 62. R_registerRoutines takes 5 arguments. The first is the DllInfo object passed by R to the initialization routine. This is where R stores the information about the methods. The remaining 4 arguments are arrays describing the routines for each of the 4 different interfaces: .C, .Call, .Fortran and .External. Each argument is a NULL-terminated array of the element types given in the following table:
.CR_CMethodDef
.CallR_CallMethodDef
.FortranR_FortranMethodDef
.ExternalR_ExternalMethodDef
  • Currently, the R_ExternalMethodDef is the same as R_CallMethodDef type and contains fields for the name of the routine by which it can be accessed in R, a pointer to the actual native symbol (i.e., the routine itself), and the number of arguments the routine expects. For routines with a variable number of arguments invoked via the .External interface, one specifies -1 for the number of arguments which tells R not to check the actual number passed. For example, if we had a routine named myCall defined as
SEXP myCall(SEXP a, SEXP b, SEXP c);
  • we would describe this as
R_CallMethodDef callMethods[] = {
    {"myCall", &myCall, 3},
    {NULL, NULL, 0}
};
  • along with any other routines for the .Call interface.
  • Routines for use with the .C and .Fortran interfaces are described with similar data structures, but which have two additional fields for describing the type and “style” of each argument. Each of these can be omitted. However, if specified, each should be an array with the same number of elements as the number of parameters for the routine. The types array should contain the SEXP types describing the expected type of the argument. (Technically, the elements of the types array are of type R_NativePrimitiveArgType which is just an unsigned integer.) The R types and corresponding type identifiers are provided in the following table:
numericREALSXP
integerINTSXP
logicalLGLSXP
singleSINGLESXP
characterSTRSXP
listVECSXP
  • Consider a C routine, myC, declared as
void myC(double *x, int *n, char **names, int *status);
  • We would register it as
R_CMethodDef cMethods[] = {
    {"myC", &myC, 4, {REALSXP, INTSXP, STRSXP, LGLSXP}},
    {NULL, NULL, 0}
};
  • One can also specify whether each argument is used simply as input, or as output, or as both input and output. The style field in the description of a method is used for this. The purpose is to allow R to transfer values more efficiently across the R-C/FORTRAN interface by avoiding copying values when it is not necessary. Typically, one omits this information in the registration data.
  • Having created the arrays describing each routine, the last step is to actually register them with R. We do this by calling R_registerRoutines. For example, if we have the descriptions above for the routines accessed by the .C and .Call we would use the following code:
void
R_init_myLib(DllInfo *info)
{
    R_registerRoutines(info, cMethods, callMethods, NULL, NULL);
}
  • This routine will be invoked when R loads the shared object/DLL named myLib. The last two arguments in the call to R_registerRoutines are for the routines accessed by .Fortran and .External interfaces. In our example, these are given as NULL since we have no routines of these types.
  • When R unloads a shared object/DLL, any registered routines are automatically removed. There is no (direct) facility for unregistering a symbol.
  • Examples of registering routines can be found in the different packages in the R source tree (e.g., stats). Also, there is a brief, high-level introduction in R News (volume 1/3, September 2001, pages 20-23).
  • In addition to registering C routines to be called by R, it can at times be useful for one

package to make some of its C routines available to be called by C code in another package. An interface to support this has been provided since R 2.4.0. The interface consists of two routines declared as

void R_RegisterCCallable(const char *package, const char *name, DL_FUNC fptr);
DL_FUNC R_GetCCallable(const char *package, const char *name);
  • A package packA that wants to make a C routine myCfun available to C code in other packages would include the call
R_RegisterCCallable("packA", "myCfun", myCfun);
  • in its initialization function R_init_packA. A package packB that wants to use this routine would retrieve the function pointer with a call of the form
p_myCfun = R_GetCCallable("packA", "myCfun");
  • The author of packB is responsible for insuring that p_myCfun has an appropriate declaration. In the future R may provide some automated tools to simplify exporting larger numbers of routines.
  • A package that wishes to make use of header files in other packages needs to declare them as a comma-separated list in the field LinkingTo in the ‘DESCRIPTION’ file. For example
Depends: link2, link3
LinkingTo: link2, link3
  • It should also ‘Depend’ on those packages for they have to be installed prior to this one, and loaded prior to this one (so the path to their compiled code can be found).
  • This then arranges that the ‘include’ directories in the installed linked-to packages are added to the include paths for C and C++ code.

5.5 shared objectsを作る

5.6 C++ コードへのインターフェース

5.7 フォートラン I/O

5.8 RオブジェクトをCの中で扱う

5.8.1 ガベージコレクションを扱う

5.8.2 記憶領域の確保

5.8.3 R typesの詳細

  • Users of the ‘Rinternals.h’ macros will need to know how the R types are known internally: if the ‘Rdefines.h’ macros are used then S4-compatible names are used.
  • The different R data types are represented in C by SEXPTYPE. Some of these are familiar from R and some are internal data types. The usual R object modes are given in the table.
SEXPTYPER equivalent
REALSXPnumeric with storage mode double
INTSXPinteger
CPLXSXPcomplex
LGLSXPlogical
STRSXPcharacter
VECSXPlist (generic vector)
LISTSXP“dotted-pair” list
DOTSXPa ‘...’ object
NILSXPNULL
SYMSXPname/symbol
CLOSXPfunction or function closure
ENVSXPenvironment
  • Among the important internal SEXPTYPEs are LANGSXP, CHARSXP, PROMSXP, etc. (Note: although it is possible to return objects of internal types, it is unsafe to do so as assumptions are made about how they are handled which may be violated at user-level evaluation.) More details are given in section “R Internal Structures” in R Internals.
  • Unless you are very sure about the type of the arguments, the code should check the data types. Sometimes it may also be necessary to check data types of objects created by evaluating an R expression in the C code. You can use functions like isReal, isInteger and isString to do type checking. See the header file ‘Rinternals.h’ for definitions of other such functions. All of these take a SEXP as argument and return 1 or 0 to indicate TRUE or FALSE. Once again there are two ways to do this, and ‘Rdefines.h’ has macros such as IS_NUMERIC.
  • What happens if the SEXP is not of the correct type? Sometimes you have no other option except to generate an error. You can use the function error for this. It is usually better to coerce the object to the correct type. For example, if you find that an SEXP is of the type INTEGER, but you need a REAL object, you can change the type by using, equivalently,
PROTECT(newSexp = coerceVector(oldSexp, REALSXP));
  • or
PROTECT(newSexp = AS_NUMERIC(oldSexp));
  • Protection is needed as a new object is created; the object formerly pointed to by the SEXP is still protected but now unused.
  • All the coercion functions do their own error-checking, and generate NAs with a warning or stop with an error as appropriate.
  • Note that these coercion functions are not the same as calling as.numeric (and so on) in R code, as they do not dispatch on the class of the object. Thus it is normally preferable to do the coercion in the calling R code.
  • So far we have only seen how to create and coerce R objects from C code, and how to extract the numeric data from numeric R vectors. These can suffice to take us a long way in interfacing R objects to numerical algorithms, but we may need to know a little more to create useful return objects.

5.8.4 Attributes

5.8.5 Classes

5.8.6 リストの扱い

5.8.8 変数を見つける、変数を設定する

  • It will be usual that all the R objects needed in our C computations are passed as arguments to .Call or .External, but it is possible to find the values of R objects from within the C given their names. The following code is the equivalent of get(name, envir = rho).
SEXP getvar(SEXP name, SEXP rho)
{
  SEXP ans;
  if(!isString(name) || length(name) != 1)
    error("name is not a single string");
  if(!isEnvironment(rho))
    error("rho should be an environment");
  ans = findVar(install(CHAR(STRING_ELT(name, 0))), rho);
  printf("first value is %f\n", REAL(ans)[0]);
  return(R_NilValue);
}
  • The main work is done by findVar, but to use it we need to install name as a name in the symbol table. As we wanted the value for internal use, we return NULL..
  • Similar functions with syntax
void defineVar(SEXP symbol, SEXP value, SEXP rho)
void setVar(SEXP symbol, SEXP value, SEXP rho)
  • can be used to assign values to R variables. defineVar creates a new binding or changes the value of an existing binding in the specified environment frame; it is the analogue of assign(symbol, value, envir = rho, inherits = FALSE), but unlike assign, defineVar does not make a copy of the object value.5 setVar searches for an existing binding for symbol in rho or its enclosing environments. If a binding is found, its value is changed to value. Otherwise, a new binding with the specified value is created in the global environment. This corresponds to assign(symbol, value, envir = rho, inherits = TRUE).

5.8.9 便利な変数達

5.8.10 Named objects and copying

5.9 インターフェース関数 .Call と .External

  • In this section we consider the details of the R/C interfaces.
  • These two interfaces have almost the same functionality. .Call is based on the interface of the same name in S version 4, and .External is based on .Internal. .External is more complex but allows a variable number of arguments.

5.9.1 .Callを呼ぶ

  • Let us convert our finite convolution example to use .Call, first using the ‘Rdefines.h’ macros. The calling function in R is
conv <- function(a, b) .Call("convolve2", a, b)
  • which could hardly be simpler, but as we shall see all the type checking must be transferred to the C code, which is
#include <R.h>
#include <Rdefines.h>
SEXP convolve2(SEXP a, SEXP b)
{
  int i, j, na, nb, nab;
  double *xa, *xb, *xab;
  SEXP ab;
  PROTECT(a = AS_NUMERIC(a));
  PROTECT(b = AS_NUMERIC(b));
  na = LENGTH(a); nb = LENGTH(b); nab = na + nb - 1;
  PROTECT(ab = NEW_NUMERIC(nab));
  xa = NUMERIC_POINTER(a); xb = NUMERIC_POINTER(b);
  xab = NUMERIC_POINTER(ab);
  for(i = 0; i < nab; i++) xab[i] = 0.0;
  for(i = 0; i < na; i++)
    for(j = 0; j < nb; j++) xab[i + j] += xa[i] * xb[j];
      UNPROTECT(3);
  return(ab);
}
  • Note that unlike the macros in S version 4, the R versions of these macros do check that coercion can be done and raise an error if it fails. They will raise warnings if missing values are introduced by coercion. Although we illustrate doing the coercion in the C code here, it often is simpler to do the necessary coercions in the R code.
  • Now for the version in R-internal style. Only the C code changes.
#include <R.h>
#include <Rinternals.h>
SEXP convolve2(SEXP a, SEXP b)
{
  R_len_t i, j, na, nb, nab;
  double *xa, *xb, *xab;
  SEXP ab;
  PROTECT(a = coerceVector(a, REALSXP));
  PROTECT(b = coerceVector(b, REALSXP));
  na = length(a); nb = length(b); nab = na + nb - 1;
  PROTECT(ab = allocVector(REALSXP, nab));
  xa = REAL(a); xb = REAL(b);
  xab = REAL(ab);
  for(i = 0; i < nab; i++) xab[i] = 0.0;
  for(i = 0; i < na; i++)
    for(j = 0; j < nb; j++) xab[i + j] += xa[i] * xb[j];
      UNPROTECT(3);
  return(ab);
}
  • This is called in exactly the same way.

5.9.2 .Externalを呼ぶ

  • 「.External」を示すために同じ例が使える.Rのコードは.Callを.Externalに変更するだけである.
conv <- function(a, b) .External("convolveE", a, b)
  • しかし主要な変更はいかにしてCコードに引数が渡されるかである.この場合は単なるSEXPによって渡される.Cコードにおける変更は引数をどう扱うかだけである.
#include <R.h>
#include <Rinternals.h>
SEXP convolveE(SEXP args)
{
  int i, j, na, nb, nab;
  double *xa, *xb, *xab;
  SEXP a, b, ab;
  PROTECT(a = coerceVector(CADR(args), REALSXP));
  PROTECT(b = coerceVector(CADDR(args), REALSXP));
  ...
}
  • Once again we do not need to protect the arguments, as in the R side of the interface they are objects that are already in use. The macros
first = CADR(args);
second = CADDR(args);
third = CADDDR(args);
fourth = CAD4R(args);
  • provide convenient ways to access the first four arguments. More generally we can use the CDR and CAR macros as in
args = CDR(args); a = CAR(args);
args = CDR(args); b = CAR(args);
  • which clearly allows us to extract an unlimited number of arguments (whereas .Call has a limit, albeit at 65 not a small one).
  • More usefully, the .External interface provides an easy way to handle calls with a variable number of arguments, as length(args) will give the number of arguments supplied (of which the first is ignored). We may need to know the names (‘tags’) given to the actual arguments, which we can by using the TAG macro and using something like the following example, that prints the names and the first value of its arguments if they are vector types.
#include <R.h>
#include <Rinternals.h>
#include <R_ext/PrtUtil.h>
SEXP showArgs(SEXP args)
{
  int i, nargs;
  Rcomplex cpl;
  const char *name;
  SEXP el;
  args = CDR(args); /* skip ’name’ */
  for(i = 0; args != R_NilValue; i++, args = CDR(args)) {
    args = CDR(args);
    name = CHAR(PRINTNAME(TAG(args)));
    switch(TYPEOF(CAR(args))) {
    case REALSXP:
      Rprintf("[%d] ’%s’ %f\n", i+1, name, REAL(CAR(args))[0]);
      break;
    case LGLSXP:
    case INTSXP:
      Rprintf("[%d] ’%s’ %d\n", i+1, name, INTEGER(CAR(args))[0]);
      break;
    case CPLXSXP:
      cpl = COMPLEX(CAR(args))[0];
      Rprintf("[%d] ’%s’ %f + %fi\n", i+1, name, cpl.r, cpl.i);
      break;
    case STRSXP:
      Rprintf("[%d] ’%s’ %s\n", i+1, name,
              CHAR(STRING_ELT(CAR(args), 0)));
      break;
    default:
      Rprintf("[%d] ’%s’ R type\n", i+1, name);
    }
  }
  return(R_NilValue);
}
  • This can be called by the wrapper function
showArgs <- function(...) .External("showArgs", ...)
  • Note that this style of programming is convenient but not necessary, as an alternative style is
showArgs1 <- function(...) .Call("showArgs1", list(...))
  • The (very similar) C code is in the scripts.

5.9.3 Missing and special values

5.10 R expressionをCの中で評価する

5.10.1 Zero-finding

5.10.2 数値微分を計算する

5.11 Parsing R code from C

5.12 External pointers and weak references

5.13 Vector accessor functions

5.14 Character encoding issues

コメント