3,430 Matching Annotations
  1. Jan 2014
    1. Data represent important products of the scientific enterprise that are, in many cases, of equivalent or greater value than the publications that are originally derived from the research process. For example, addressing many of the grand challenge scientific questions increasingly requires collaborative research and the reuse , integration, and synthesis of data.

      Who else might care about this other than Grand Challenge Question researchers?

    2. Journals and sponsors want you to share your data

      What is the sharing standard? What are the consequences of not sharing? What is the enforcement mechanism?

      There are three primary sharing mechanisms I can think of today: email, usb stick, and dropbox (née ftp).

      The dropbox option is supplanting ftp which comes from another era, but still satisfies an important niche for larger data sets and/or higher-volume or anonymous traffic.

      Dropbox, email and usb are all easily accessible parts of the day-to-day consumer workflow; they are all trivial to set up without institutional support or, importantly, permission.

      An email account is already provisioned by default for everyone or, if the institutional email offerings are not sufficient, a person may easily set up a 3rd-party email account with no permission or hassle.

      Data management alternatives to these three options will have slow or no adoption until the barriers to access and use are as low as email; the cost of entry needs to be no more than *a web browser, an email address, and no special permission required".

    3. An effective data management program would enable a user 20 years or longer in the future to discover , access , understand, and use particular data [ 3 ]. This primer summarizes the elements of a data management program that would satisfy this 20-year rule and are necessary to prevent data entropy .

      Who cares most about the 20-year rule? This is an ideal that appeals to some, but in practice even the most zealous adherents can't picture what this looks like in some concrete way-- except in the most traditional ways: physical paper journals in libraries are tangible examples of the 20-year rule.

      Until we have a digital equivalent for data I don't blame people looking for tenure or jobs for not caring about this ideal if we can't provide a clear picture of how to achieve this widely at an institutional level. For digital materials I think the picture people have in their minds is of tape backup. Maybe this is generational? New generations not exposed widely to cassette tapes, DVDs, and other physical media that "old people" remember, only then will it be possible to have a new ideal that people can see in their minds-eye.

    4. A key component of data management is the comprehensive description of the data and contextual information that future researchers need to understand and use the data. This description is particularly important because the natural tendency is for the information content of a data set or database to undergo entropy over time (i.e. data entropy ), ultimately becoming meaningless to scientists and others [ 2 ].

      I agree with the key component mentioned here, but I feel the term data entropy is an unhelpful crutch.

    5. data entropy Normal degradation in information content associated with data and metadata over time (paraphrased from [ 2 ]).

      I'm not sure what this really means and I don't think data entropy is a helpful term. Poor practices certainly lead to disorganized collections of data, but I think this notion comes from a time when people were very concerned about degradation of physical media on which data is stored. That is, of course, still a concern, but I think the term data entropy really lends itself as an excuse for people who don't use good practices to manage data and is a cover for the real problem which is a kind of data illiteracy in much the same way we also face computational illiteracy widely in the sciences. Managing data really is hard, but let's not mask it with fanciful notions like data entropy.

    6. Although data management plans may differ in format and content, several basic elements are central to managing data effectively.

      What are the "several basic elements?"

    7. By documenting your data and recommending appropriate ways to cite your data, you can be sure to get credit for your data products and their use

      Citation is an incentive. An answer to the question "What's in it for me?"

    8. This primer describes a few fundamental data management practices that will enable you to develop a data management plan, as well as how to effectively create, organize, manage, describe, preserve and share data

      Data management practices:

      • create
      • organize
      • manage
      • describe
      • preserve
      • share
    9. The goal of data management is to produce self-describing data sets. If you give your data to a scientist or colleague who has not been involved with your project, will they be able to make sense of it? Will they be able to use it effectively and properly?
    1. data practices of researchers – data accessibility, discovery, re-use, preservation and, particularly, data sharing
      • data accessibility
      • discovery
      • re-use
      • preservation
      • data sharing
    1. One respondent noted that NSF doesn't have an enforcement policy. This is presumably true of other mandate sources as well, and brings up the related and perhaps more significant problem that mandates are not always (if they are ever) accompanied by the funding required to satisfy them. Another respondent wrote that funding agencies expect universities to contribute to long-term data storage.
    2. Data management activities, grouped. The data management activities mentioned by the survey can be grouped into five broader categories: "storage" (comprising backup or archival data storage, identifying appropriate data repositories, day-to-day data storage, and interacting with data repositories); "more information" (comprising obtaining more information about curation best practices and identifying appropriate data registries and search portals); "metadata" (comprising assigning permanent identifiers to data, creating and publishing descriptions of data, and capturing computational provenance); "funding" (identifying funding sources for curation support); and "planning" (creating data management plans at proposal time). When the survey results are thus categorized, the dominance of storage is clear, with over 80% of respondents requesting some type of storage-related help. (This number may also reflect a general equating of curation with storage on the part of respondents.) Slightly fewer than 50% of respondents requested help related to metadata, a result explored in more detail below.

      Categories of data management activities:

      • storage
        • backup/archival data storage
        • identifying appropriate data repositories
        • day-to-day data storage
        • interacting with data repositories
      • more information
        • obtaining more information about curation best practices
        • identifying appropriate data registries
        • search portals
      • metadata
        • assigning permanent identifiers to data
        • creating/publishing descriptions of data
        • capturing computational provenance
      • funding
        • identifying funding sources for curation support
      • planning
        • creating data management plans at proposal time
    3. Data management activities, grouped. The data management activities mentioned by the survey can be grouped into five broader categories: "storage" (comprising backup or archival data storage, identifying appropriate data repositories, day-to-day data storage, and interacting with data repositories); "more information" (comprising obtaining more information about curation best practices and identifying appropriate data registries and search portals); "metadata" (comprising assigning permanent identifiers to data, creating and publishing descriptions of data, and capturing computational provenance); "funding" (identifying funding sources for curation support); and "planning" (creating data management plans at proposal time). When the survey results are thus categorized, the dominance of storage is clear, with over 80% of respondents requesting some type of storage-related help. (This number may also reflect a general equating of curation with storage on the part of respondents.) Slightly fewer than 50% of respondents requested help related to metadata, a result explored in more detail below.

      Storage is a broad topic and is a very frequently mentioned topic in all of the University-run surveys.

      http://www.alexandria.ucsb.edu/~gjanee/dc@ucsb/survey/plots/q4.2.png

      Highlight by Chris during today's discussion.

    4. Distribution of departments with respect to responsibility spheres. Ignoring the "Myself" choice, consider clustering the parties potentially responsible for curation mentioned in the survey into three "responsibility spheres": "local" (comprising lab manager, lab research staff, and department); "campus" (comprising campus library and campus IT); and "external" (comprising external data repository, external research partner, funding agency, and the UC Curation Center). Departments can then be positioned on a tri-plot of these responsibility spheres, according to the average of their respondents' answers. For example, all responses from FeministStds (Feminist Studies) were in the campus sphere, and thus it is positioned directly at that vertex. If a vertex represents a 100% share of responsibility, then the dashed line opposite a vertex represents a reduction of that share to 20%. For example, only 20% of ECE's (Electrical and Computer Engineering's) responses were in the campus sphere, while the remaining 80% of responses were evenly split between the local and external spheres, and thus it is positioned at the 20% line opposite the campus sphere and midway between the local and external spheres. Such a plot reveals that departments exhibit different characteristics with respect to curatorial responsibility, and look to different types of curation solutions.

      This section contains an interesting diagram showing the distribution of departments with respect to responsibility spheres:

      http://www.alexandria.ucsb.edu/~gjanee/dc@ucsb/survey/plots/q2.5.png

    5. In the course of your research or teaching, do you produce digital data that merits curation? 225 of 292 (77%) of respondents answered "yes" to this first question, which corresponds to 25% of the estimated population of 900 faculty and researchers who received the survey.

      For those who do not feel they have data that merits curation I would at least like to hear a description of the kinds of data they have and why they feel it does not need to be curated?

      For some people they may already be using well-curated data sets; on the other hand there are some people who feel their data may not be useful to anyone outside their own research group, so there is no need to curate the data for use by anyone else even though under some definition of "curation" there may be important unmet curation needs for internal-use only that may be visible only to grad students or researchers who work with the data hands-on daily.

      UPDATE: My question is essentially answered here: https://hypothes.is/a/xBpqzIGTRaGCSmc_GaCsrw

    6. Responsibility, myself versus others. It may appear that responses to the question of responsibility are bifurcated between "Myself" and all other parties combined. However, respondents who identified themselves as being responsible were more likely than not to identify additional parties that share that responsibility. Thus, curatorial responsibility is seen as a collaborative effort. (The "Nobody" category is a slight misnomer here as it also includes non-responses to this question.)

      This answers my previous question about this survey item:

      https://hypothes.is/a/QrDAnmV8Tm-EkDuHuknS2A

    7. Awareness of data and commitment to its preservation are two key preconditions for successful data curation.

      Great observation!

    8. Which parties do you believe have primary responsibility for the curation of your data? Almost all respondents identified themselves as being personally responsible.

      For those that identify themselves as personally responsible would they identify themselves (or their group) as the only ones responsible for the data? Or is there a belief that the institution should also be responsible in some way in addition to themselves?

    9. Availability of the raw survey data is subject to the approval of the UCSB Human Subjects Committee.
    10. Survey design The survey was intended to capture as broad and complete a view of data production activities and curation concerns on campus as possible, at the expense of gaining more in-depth knowledge.

      Summary of the survey design

    11. Researchers may be underestimating the need for help using archival storage systems and dealing with attendant metadata issues.

      In my mind this is a key challenge: even if people can describe what they need for themselves (that in itself is a very hard problem), what to do from the infrastructure standpoint to implement services that aid the individual researcher and also aid collaboration across individuals in the same domain, as well as across domains and institutions... in a long-term sustainable way is not obvious.

      In essence... how do we translate needs that we don't yet fully understand into infrastructure with low barrier to adoption, use, and collaboration?

    12. Researchers view curation as a collaborative activity and collective responsibility.
    13. To summarize the survey's findings: Curation of digital data is a concern for a significant proportion of UCSB faculty and researchers. Curation of digital data is a concern for almost every department and unit on campus. Researchers almost universally view themselves as personally responsible for the curation of their data. Researchers view curation as a collaborative activity and collective responsibility. Departments have different curation requirements, and therefore may require different amounts and types of campus support. Researchers desire help with all data management activities related to curation, predominantly storage. Researchers may be underestimating the need for help using archival storage systems and dealing with attendant metadata issues. There are many sources of curation mandates, and researchers are increasingly under mandate to curate their data. Researchers under curation mandate are more likely to collaborate with other parties in curating their data, including with their local labs and departments. Researchers under curation mandate request more help with all curation-related activities; put another way, curation mandates are an effective means of raising curation awareness. The survey reflects the concerns of a broad cross-section of campus.

      Summary of survey findings.

    14. In 2012 the Data Curation @ UCSB Project surveyed UCSB campus faculty and researchers on the subject of data curation, with the goals of 1) better understanding the scope of the digital curation problem and the curation services that are needed, and 2) characterizing the role that the UCSB Library might play in supporting curation of campus research outputs.

      1) better understanding the scope of the digital curation problem and the curation services that are needed

      2) characterizing the role that the UCSB Library might play in supporting curation of campus research outputs.

    1. The project will develop an analysis package in the open-source language R and complement it with a step-by-step hands-on manual to make tools available to a broad, international user community that includes academics, scientists working for governments and non-governmental organizations, and professionals directly engaged in conservation practice and land management. The software package will be made publicly available under http://www.clfs.umd.edu/biology/faganlab/movement/.

      Output of the project:

      • analysis package written in R
      • step-by-step hands-on manual
      • make tools available to a broad, international community
      • software made publicly available

      Question: What software license will be used? The Apache software license is potentially a good choice here because it is a strong open source license supported by a wide range of communities with few obligations or barriers to access/use which supports the goal of a broad international audience.

      Question: Will the data be made available under a license, as well? Maybe a CC license of some sort?

    2. These species represent not only different types of movement (on land, in air, in water) but also different types of relocation data (from visual observations of individually marked animals to GPS relocations to relocations obtained from networked sensor arrays).

      Movement types:

      • land
      • air
      • water

      Types of relocation data:

      • visual observations
      • GPS
      • networked sensor arrays
    1. Once a searchable atlas has been constructed there are fundamentally two approaches that can be used to analyze the data: one visual, the other mathematical.
    2. The initial inputs for deriving quantitative information of gene expression and embryonic morphology are raw image data, either of fluorescent proteins expressed in live embryos or of stained fluorescent markers in fixed material. These raw images are then analyzed by computational algorithms that extract features, such as cell location, cell shape, and gene product concentration. Ideally, the extracted features are then recorded in a searchable database, an atlas, that researchers from many groups can access. Building a database with quantitative graphical and visualization tools has the advantage of allowing developmental biologists who lack specialized skills in imaging and image analysis to use their knowledge to interrogate and explore the information it contains.

      1) Initial input is raw image data 2) feature extraction on raw image data 3) extracted features stored in shared, searchable database 4) database available to researchers from many groups 5) quantitative graphical and visualization tools allow access to those without specialized skill in imaging and image analysis

    1. We regularly provide scholars with access to content for this purpose. Our Data for Research site (http://dfr.jstor.org)

      The access to this is exceedingly slow. Note that it is still in beta.

  2. Nov 2013
    1. Not even gephi is very good at visualising temporal networks.

      Hmm I disagree. In teh version of Gephi very thing is cool.