Developing Business Cases for New Services in Research Libraries

Environmental Scan: Data Management and Curation Services

Over the last several years, a host of reports, articles, and presentations has emerged that demonstrate how core library skill sets, processes, and services may be updated and translated to meet the need for data management and curation, particularly in the area of large scale, data-driven scholarly research, popularly known as e-science or, as the same tools expand into the social sciences and humanities, e-research. Libraries have recognized expertise in the ingestion, organization, preservation, and accessibility of information resources, all of which are necessary components of making sense of data management. As one involved scientist put it succinctly, "Researchers need help with things librarians are good at" (Berman 2008). In addition to these skills, libraries work across academic and organizational boundaries; data management and curation is not scalable in siloed environments (Luce 2008). Also, libraries may be perceived as credible and competent information brokers who act with long-term commitment (Corson-Rikert and McCue 2007).

Libraries bring order and stability to the information environment by developing and sharing standardized tools, taxonomies, architectures, and data transfer and linking protocols, and by creating policies to exploit these effectively. To jumpstart libraries' ability to bring these valuable skills to the data management and curation arena, the National Science Foundation (NSF) established the DataNet program in 2007; the NSF has since mandated data management and sharing plans from potential grantees. This $100 million funding program was developed to create up to five "exemplar national and global data research infrastructure organizations (dubbed DataNet Partners) that provide unique opportunities to communities of researchers to advance science and/or engineering research and learning. . . . By demonstrating feasibility, identifying best practices, establishing viable models for long term technical and economic sustainability, and incorporating frontier research, these exemplar organizations can serve as the basis for rational investment in digital preservation and access by diverse sectors of society at the local, regional, national, and international levels, paving the way for a robust and resilient national and global digital data framework" (NSF 2007). Since the inception of DataNet, DataNet Partners and other actors have made impressive headway in the areas of education, policy, research, and services (Gold 2010). Despite prevailing global economic issues, a significant number of research libraries are engaged in creating, participating in, or in consulting on data management and curation activities. "The fact that investments in e-science activities are being made even during difficult budget times demonstrates that this is a priority for libraries" (Soehner, Steeves, and Ward 2010).

Building capacity for data management services may also advance related or broader research library goals. One of these is enabling a conceptual shift away from viewing libraries as primarily collections-oriented repositories of information toward viewing them as service providers that actively support the exchange of ideas and knowledge across the disciplines (Gold 2010). Institutional repositories, many of which have been underpopulated to date, may be reconceived as data warehouses (Gold 2010). There is a fear that publishers or information vendors may step in to provide data services and then charge universities high fees to access the data generated (Soehner, Steeves, and Ward 2010). Demonstrating institutional management that mitigates such budgetary constraints parallels and reinforces related open access goals. Finally, movement in this area may create new opportunities to build close working relationships between library staff and researchers (Gold 2010). This would redirect library futures away from some of the more marginalized scenarios that might be imagined from Ithaka's faculty surveys (Schonfeld and Housewright 2010) and OCLC's Research Libraries, Risk and Systemic Change (Michalko, Malpas, and Arcolio 2010).

Reports on developing data management services also reveal the shoals, troughs, and currents to be encountered in navigating the sea of research data. The research enabled by the cyberinfrastructure can be massive in scale, conducted by researchers spread across multiple institutions, and require grid computing capabilities. Creating order and stability across this ecosystem will not be accomplished by individual libraries. It is likely that a small number of research libraries, in concert with government bodies, professional organizations, and industry, will have roles in developing strategies and economic models for supporting large-scale networked research (Gold 2010). These are likely to be the libraries involved in the DataNet program. The size of the DataNet awards, $20 million each over a five-year period, recognizes the scope of the investment needed to embed storage and curatorial services and tools into the cyberinfrastructure. NSF expectations are high: "Proposals must cross more than one domain, involve diverse content, and address issues of sustainability. They should integrate library and archival science, computer science, and other scientific domains. Proposals should push the frontiers. Key issues include who are the users now and who might they be in the future. . . . Projects must provide for full life-cycle management, including data deposition and acquisition; data curation; metadata management; privacy and security; data discovery, access and dissemination; data interoperability and integration; data evaluation and visualization; related research; and education and training activities. Community and user input must be collected and assessed" (Spengler 2009). Funded projects must accomplish all of the goals of successful data management and curation in support of big, networked research. Their success will lay the groundwork for wider involvement. Most libraries will build on this work in support of data curation needs. It is equally clear that this will need to be done collaboratively; the scale of the issues is too great in most cases to promote institutional solutions.

The specific elements of data management and curation articulated by Spengler offer their own separate challenges. For example, the 2010 Association of Research Libraries (ARL) study of activities in ARL libraries noted that very little work is currently underway in libraries to support digital lab notebooks, and that most libraries are unaware of whether this is happening anywhere on their campuses or if there are archival services for lab notebooks (Soehner, Steeves, and Ward 2010). The scope of data deposition and acquisition is not clear. Flexible and extensible metadata schemes, such as the DataCite Metadata Kernel, are still in early development (Brase and Farquhar 2011) and are just beginning to build communities of practice. Data publication standards are not as advanced as text publication standards (Hense and Quadt 2011). Questions of ownership, sharing, retention, differing funding body requirements, and interoperability are all pressure points (Gold 2010; Jones 2008). While most of these issues will be worked through over time, they make it difficult to do effective service planning in the present environment.

The greatest challenges in data management and curation may be those of education, acceptance, organization, and sustainability. The technical skills needed to work comfortably in discipline-based informatics, which most closely meet data support needs, are extensive (Henty 2008). These are not broadly represented in libraries currently and will require additional education and training (Soehner, Steeves, and Ward 2010). The divide between researchers' and library staff's skill sets may limit acceptance of library involvement in data management and curation. As a respondent to the ARL study said, "No matter the educational background, one important aspect of working with faculty is becoming a trusted member of their team, and this pressure will remain until enough evidence of successful faculty-library projects have been completed" (Soehner, Steeves, and Ward 2010). Even though libraries are responding to a real need, their efforts will have little impact if there is no desire on the part of faculty and researchers to turn to them for these services (Soehner, Steeves, and Ward 2010). Decentralization and lack of unifying direction in research universities, where research groups often work in happy isolation from one another, militates against the organization needed to make services effective and efficient (Soehner, Steeves, and Ward 2010). Collaboration is critical in supporting e-research, and libraries are working across institutional boundaries to build the needed infrastructures to do so (Gold 2010). If such last-mile connections are necessary, how can the work be accomplished in the absence of first-mile cooperation?

Finally, funding and sustainability are still open questions that must be answered. DataNet goals assume that grantees will create and demonstrate sustainable funding streams, without which robust services cannot be built for the long term. As the grants are still in place, it is not currently possible to know how potential models may be developing. This is a clear concern in research libraries, where funding is fragile and already stretched to capacity (Gold 2010). The Data Conservancy, one large and visible DataNet Partner, plans on depending upon "a portfolio of funding streams," including leveraging partners' sustainability or funding (Lynch 2009). But this has yet to be put to the test. The recommendations put forward by the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (2010) which certainly respond to data management topics but will require broad conversation and significant commitment at institutional and national levels, changes that do not take place overnight.

