Farewell CDL!

[Cross Posted from the UC3 Blog]

A little over two years ago, after an exhausting day of packing up our apartment in Brooklyn, I turned to my partner and said “Hey, remember when said I wasn’t going to do a postdoc?”.

This was a joke, intended to offset the anxiety we were both feeling about our impending move across the country. But, after deciding to not pursue the “traditional” academic path (graduate school → postdoctoral fellowships → faculty position) and shifting from working in cognitive neuroscience labs to working in academic libraries, I had long assumed that my window into the liminal space occupied by postdocs had closed. That is, until I learned about the CLIR Postdoctoral Fellowship Program and saw an opportunity to dive headfirst into the wider world of scholarly communications and open science with the UC3 team at California Digital Library.

Today is my last day in the office at CDL and so much has happened in the world and for me personally and professionally over the course of my fellowship that I’m not sure anything I could write here would ever do it all justice. I suppose I could assess my time at CDL in terms of the number of posters, papers, and presentations I helped put together. I could mention my involvement with groups like BITSS and RDA. I could add up all the hours I’ve spent talking on Skype and Zoom or all the words I’ve written (and rewritten) in Slack and Google Docs. But really the most meaningful metric of my time at CDL would be the number of new colleagues, collaborators, and friends I’ve gained as a postdoc. I came to CDL because I wanted to become part of the broad community of folks working on research data in academic libraries. And now, as I’m about to move into a new position as the Data Services Librarian at Lane Medical Library, I can say that has happened more than I would have thought possible.

Looking back on the last two years, there are about a million people I owe a heartfelt thanks. If you’re out there and you don’t get an email from me, it’s almost definitely because I wrote something, decided it was completely insufficient, wrote something else, decided that was completely insufficient, and then got completely overwhelmed by the number of drafts in my mailbox. But seriously, thanks to everyone on the UC3 team, at CDL and the UC libraries, and beyond for everything you’ve done for me and for everything you’ve helped me do.

Looking forward to what comes next, I have about a million ideas for new projects. Some of are extensions of work I started during my fellowship while others are the product of the connections, insights, or interests I developed while at CDL. But, since this is my last blog post as a postdoc, I also want to devote some space one last UC3 project update.

Support Your Data

If there is a common thread that ties together all of the work I’ve done at CDL it is that I really want to bridge the communication gap that exists between researchers and data librarians. The most explicit manifestation of this has been the Support Your Data project.

If you’ve missed all my blog postsposters, and presentations on the topic, the goal of the Support Your Data project is to create tools for researchers to assess and advance their own data management practices. With an immense amount of help from the UC3 team, I drafted a rubric that describes activities related to data management and sharing in a framework that we hope is familiar and useful to researchers. Complementing this rubric, we also created a series of short guides that give actionable advice on topics such as data management planning, data organization and storage, documentation, and data publishing. Because we assumed that different research communities (e.g. researchers in different disciplines, researchers at different institutions) have different data-related needs and access to different data-related resources, all of these materials were designed with an eye towards easy customization.

A full rundown of the Support Your Data project will be given in a forthcoming project report. The short version is that, now that the majority of the content has been drafted, the next step is to work on design and adoption. We want researchers and librarians to use these tools so we want to make sure the final products don’t look like something I’ve been working on in a series of Google spreadsheets. Though I will no longer be leading the project, this work will continue at CDL. That said, I have a lot of ideas about using the Support Your Data materials as they currently exist as a jumping off point for future projects.

Data Management Practices in Neuroscience

I’m still surprised I convinced a library to let me do a neuroimaging project. I mean, I’m not that surprised, I can be pretty convincing when I start arguing that neuroimaging is a perfect case study for studying how researchers actually manage their data. But I think it says a lot about the UC3 team that they fully supported me as I dove deep into the literature describing fMRI data analysis workflows, charted the history of data sharing in cognitive neuroscience, and wrangled all manner of acronyms (ahem, BIDSBIDS).

As I outlined in a previous blog post, the idea to survey neuroimaging researchers literally started with a tweet. But, before too long, it became a full fledged collaborative research project. As a former imaging researcher, I am still marveling over the fact that my collaborator Ana Van Gulick- another neuroscientist turned research data in libraries person- and I managed to collect data from over 140 participants so quickly. Our principle aim was to provide valuable insights to be both the neuroimaging and data curation community, but this project also gave us the opportunity to practice what we preach and apply open science practices to our own work. A paper describing the results of our survey of the data management practices of MRI researchers is currently through the peer review process, but we’ve already published a preprint and made our materials and data openly available.

We definitely hope to continue working with the neuroimaging community, but we also plan to do follow-up surveys of other research communities. Given the growing emphasis on transparency and open science practices in the field, what do data management practices look like in psychology? We hope to find out soon!

Exploring Researcher Needs and Values Related to Software

One of the principle aims of my fellowship was to explore issues around software curation. Spoiler alert: Though the majority of my projects touched on the subject of research software in some way, I’m still not sure I’ve come up with a comprehensive definition of what “software curation” actually means in practice. Shoutout to my fellow software curation fellows who continue to bring their array of perspectives and high levels of expertise to this issue (and thanks for not rolling your eyes at the cognitive neuroscientist trying to understand how computers work).

Before I started at CDL I knew that I would be working with Yasmin AlNoamany, my counterpart at the UC Berkeley library, on a project involving research software. To extend previous work done by the UC3 around issues related to data publishing, we eventually decided to survey researchers on how their use, share, and value their software tools. Our results, which we hope will help libraries and other research support groups shape their service offerings, are described in this preprint. We’ve also made our materials and data openly available.

There is still a lot of work to be done defining the problems and solutions of software curation. Though we currently don’t have plans to do any follow-up studies, we have another paper in the works describing the rest of our results and our survey will definitely inform how I plan to organize software-related training and outreach in the future. The UC3 team will also be continuing to work in this area, through their involvement with The Carpentries.

But wait, there’s more

Earlier this week, after another exhausting day of packing up our apartment outside of Berkeley, I keep remarking to my partner “Hey, remember when I thought I’d never get a job at Stanford.”

This is a joke too. We’re not moving across the country this time, but the move feels just as significant. Two years ago I was sad to leave New York, but ultimately decided I needed to take a step forward in my career. Now, as I’m about to take another step, I’m very sad to leave CDL. I’ve very excited about what comes next, of course. But I will always be grateful for CLIR and the UC3 team giving me to opportunity to learn so much and connect with so many amazing friends, collaborators, and colleagues.

Thanks everyone!

Advertisements

Neuroimaging as a case study in research data management (Part 2)

[Cross Posted from Medium]

Part 2: On practicing what we preach

A few weeks ago I described the results of a project investigating the data management practices of neuroimaging researchers. The main goal of this work is to help inform efforts to address rigor and reproducibility in both the brain imaging (neuroimaging) and academic library communities. But, as we were developing our materials, a second goal emerged- practice what we preach and actually apply the open science methods and tools we in the library community have been recommending to researchers

Wait, what? Open science methods and tools

Before jumping into my experience of using open science tools as part of a project that involves investigating open sciences practices, it’s probably worth taking a step back and defining what the term actually means. It turns out this isn’t exactly easy. Even researchers working in the same field understand and apply open science in different ways. To make things simpler for ourselves when developing our materials, we used “open science” broadly to refer to the application of methods and tools that make the processes and products of the research enterprise available for examination, evaluation, use, and re-purposing by others. This definition doesn’t address the (admittedly fuzzy) distinctions between related movements such open access, open data, open peer review, and open source, but we couldn’t exactly tackle all of that in a 75 question survey.

From programming languages used for data analysis like Python and R to collaboration platforms like the Github and the Open Science Framework (OSF) to writing tools like LaTex and Zotero to data sharing tools like Dashfigshare, and Zenodo, there are A LOT of different methods and tools that fall under the category of open science. Some of them worked for our project, some of them didn’t.

Data Analysis Tools

As both an undergraduate and graduate student, all of my research methods and statistics courses involved analyzing data with SPSS. Even putting aside the considerable (and recurrent) cost of an SPSS licence, I wanted to go a different direction in order to get some first-hand experience with the breadth of analysis tools that have been developed and popularized over the last few years.

I thought about trying my hand at a Jupyter notebook, which would have allowed us to share all of our data and analyses in one go. However, I also didn’t want to delay things as I taught myself how to work within a new analysis environment. From there, I tried a few “SPSS-like” applications like PSPP and Jamovi and would recommend both to anyone who has a background like mine and isn’t quite ready to start writing code. I ultimately settled on JASP because, after taking a cursory look through our data using Excel (I know, I know), it was actually being used by the participants in our sample. It turns out that’s probably because it’s really intuitive and easy to use. Now that I’m not in the middle of analyzing data, I’m going to spend some time learning other tools. But, while I do that, I’m going to to keep using and recommending JASP.

From the very beginning, we planned on making our data open. Though I wasn’t necessarily thinking about it at the time, this turned out to be another good reason to try something other than SPSS. Though there are workarounds, .sav is not exactly an open file format. But our plan to make the data open not only affected the choice of analysis tools, it also affected how I felt while running the various statistical tests. One one hand, knowing that other researchers would be able to dive deep into our data amplified my normal anxiety about checking and re-checking (and statcheck-ing) the analyses. On the other hand, it also greatly reduced my anxiety about inadvertently relegating an interesting finding to the proverbial file-drawer.

Collaboration Tools

When we first started, it seemed sensible to create a repository on the Open Science Framework in order to keep our various files and tools organized. However, since our collaboration is between just two people and there really aren’t that many files and tools involved, it became easier to just use services that were already incorporated in our day-to-day work- namely e-mail, Skype, Google Drive, and Box. Though I see how it could be potentially useful for a project with more moving parts, for our purposes it mostly just added an unnecessary extra step.

Writing Tools

This is where I restrain myself from complaining too much about LaTeX. Personally, I find it a less than awesome platform for doing any kind of collaborative writing. Since we weren’t writing chunks of code, I also couldn’t find an excuse to write the paper in R Markdown. Almost all of the collaborative writing I’ve done since graduate school has been in Google docs and this project was no exception. It’s not exactly the best when it comes to formatting text or integrating with tables and figures, I haven’t found a better tool for working on a text with other people.

We used a Mendeley folder to share papers and keep our citations organized. Zotero has the same functionality, but I personally find Mendeley slightly easier to use. In retrospect, we could also have used something like the F1000 Workspace that has a more direct integration with Google docs.

This project is actually the first time I’ve published on a preprint. Like making our data open, this was the plan all along. The formatting was done in Overleaf, mostly because it was a (relatively) user friendly way to use LaTeX and I was worried our tables and figures would break the various MS Word bioRxiv templates that are floating around. Similar making our data open, planning to publish a preprint had a impact on the writing process. I’ve since notices a typo or two, but knowing that people would be reading our preprint only days after its submission made me especially anxious to check the spelling, grammar, and the general flow of our paper. On the other hand, it was a relief to know that the community would be able to read the results of a project that started at the very beginning of my postdoc before it’s conclusion.

Data Sharing Tools

Our survey and data are both available via figshare. More specifically, we submitted our materials to Kilthub, Carnegie Mellon’s instance of figshare for institutions. For those of you out there currently raising an eyebrow, we didn’t submit to Dash, UC3’s data publication platform, because of an agreement outlined when we were going through the IRB process. Overall, the submission was relatively straightforward, through the curation process definitely made me consider how difficult it is to balance adding proper metadata and documentation to a project with the desire (or need) to just get material out there quickly.

A few more thoughts on working openly

More than once over the course of this project I joked to myself, my collaborator, or really to anyone that would listen that “This would probably be easier or quicker if we could just do it the old way.”. However, now that we’re at a point where we’ve submitted our paper (to an open access journal, of course), it’s been useful to look back on what it has been like to use these different open science methods and tools. My main takeaways are that there are a lot of ways to work openly and that what works for one researcher may not necessarily work for another. Most of the work I’ve done as a postdoc has been about meeting researchers where they are and this process has reinforced my desire to do so when talking about open science, even when the researcher in question is myself.

Like our study participants, who largely reported that their data management practices are motived and limited by immediate practical concerns, a lot of our decisions about open which open science methods and tools to apply were heavily influenced by the need to keep our project moving forward. As much as I may have wanted to, I couldn’t pause everything to completely change how I analyze data or write papers. We committed ourselves to working openly, but we also wanted to make sure we had something to show for ourselves.

Additional Reading

Borghi, J. A., & Van Gulick, A. E. (2018). Data management and sharing in neuroimaging: Practices and perceptions of MRI researchersbioRxiv.

Neuroimaging as a case study in research data management

[Cross Posted from Medium]

Part 1: What we did and what we found

How do brain imaging researchers manage and share their data? This question, posed rather flippantly on Twitter a year and a half ago, prompted a collaborative research project. To celebrate the recent publication of a bioRxiv preprint, here is an overview of what we did, what we found, and what we’re looking to do next.

What we did and why

Magnetic resonance imaging (MRI) is a widely-used and powerful tool for studying the structure and function of the brain. Because of the complexity of the underlying signal, the iterative and flexible nature of analytical pipelines, and the cost (measured in terms of both grant funding and person hours) of collecting, saving, organizing, and analyzing such large and diverse datasets, effective research data management (RDM) is essential in research projects involving MRI. However, while the field of neuroimaging has recently grappled with a number of issues related to the rigor and reproducibility of its methods, information about how researchers manage their data within the laboratory remains mostly anecdotal.

Within and beyond the field of neuroimaging, efforts to address rigor and reproducibility often focus on problems such as publication bias and sub-optimal methodological practices and solutions such as the open sharing of research data. While it doesn’t make for particularly splashy headlines (unlike, say, this), RDM is also an important component of establishing rigor and reproducibility. If experimental results to be verified and repurposed, the underlying data must be properly saved and organized. Said another way, even openly shared data isn’t particularly useful if you can’t make sense of it. Therefore, in an effort to inform the ongoing conversation about reproducibility in neuroimaging, myself and Ana Van Gulick set out to survey the RDM practices and perceptions of the active MRI research community.

With input from several active neuroimaging researchers, we designed and distributed a survey that described RDM-related topics using language and terminology familiar to researchers who use MRI. Questions inquired about the type(s) of data collected, the use analytical tools, procedures for transferring and saving data, and the degree to which RDM practices and procedures were standardized within laboratories or research groups. Building on my work to develop an RDM guide for researchers, we also asked participants to rate the maturity of both their own RDM practices and those of the field as a whole. Throughout the survey, we were careful to note that our intention was not to judge researchers with different styles of data management and that RDM maturity is largely orthogonal to the sophistication of data collection and analysis techniques.

Wait, what? A brief introduction to MRI and RDM.

Magnetic resonance imaging (MRI) is a medical imaging technique that uses magnetic fields and radio waves to create detailed images of organs and tissues. Widely used in medical settings, MRI has also become important tool for neuroscience researchers especially since the development of functionalMRI (fMRI) in the early 1990’s. By detecting changes in blood flow that are associated with changes in brain activity, fMRI allows researchers to non-invasively study the structure and function of the living brain.

Because there are so many perspectives involved, it is difficult to give a single comprehensive definition of research data management (RDM). But, basically, the term covers activities related to how data is handled over the course of a research project. These activities include, but are certainly not limited to, those related to how data is organized and saved, how procedures and decisions are documented, and how research outputs are stored are shared. Many academic libraries have begun to offer services related to RDM.

Neuroimaging research involving MRI presented something of an ideal case study for us to study RDM among active researchers. The last few years have seen a rapid proliferation of standards, tools, and best practice recommendations related to the management and sharing of MRI data. Neuroimaging research also crosses many topics relevant to RDM support providers such as data sharing and publication, the handling of sensitive data, and the use and curation of research software. Finally, as neuroimaging researchers who now work in academic libraries, we are uniquely positioned to work across the two communities.

What we found

After developing our survey and receiving the appropriate IRB approvals, we solicited responses to our survey during Summer 2017. A total of 144 neuroimaging researchers participated and their responses revealed several trends that we hope will be informative for both neuroimaging researchers and also data support providers in a academic libraries.

As shown below, our participants indicated that their RDM practices throughout the course of a research project were largely motivated by immediate practical concerns such as preventing the loss of data and the ensuring access to everyone within a lab or research group and limited by a lack of time and discipline-specific best practices.

We were relatively unsurprised to see that neuroimaging researchers use a wide array of software tools analyze their often heterogeneous sets of data. What did surprise us somewhat was the different responses from trainees (graduate students and postdocs) and faculty on questions related to the consistency of RDM practices within their labs. Trainees were significantly less likely to say that practices related to backing up, organizing, and documenting datas were standardized within their lab than faculty, which we think highlights the need for better communication about how RDM is an essential component of ensuring that research is rigorous and reproducible.

Analysis of RDM maturity ratings revealed that our sample generally rated their own RDM practices as more mature than the field as a whole and practices during the data collection and analysis phases of a project as significantly more mature than those during the data sharing phase. There are several interpretations of the former result, but the later is consistent with the low level of data sharing in the field. Though these ratings provide an interesting insight into the perceptions of the active research community, we believe there is substantial room for improvement in establishing proper RDM across every phase of a project, not just after after the data has already been analyzed.

For a complete overview of our results, including an analysis of how the field of neuroimaging is at a major point of transition when it comes to the adoption of practices including open access publishing, preregistration, replication, check out our preprint now on bioRxiv. While you’re at it, feel free to peruse, reuse, or remix our survey and data, both of which are available on figshare.

Is this unique to MRI research?

Definitely not. Just as the consequences of sub-optimal methodologicalpractices and publication biases have been discussed throughout the biomedical and behavioral sciences for decades, we suspect that the RDM-related practices and perceptions observed in our survey are not limited to neuroimaging research involving MRI.

To paraphrase and reiterate a point made in the preprint, this work was intended to be descriptive not prescriptive. We also very consciously have not provided best practice recommendations because we believe that such recommendations would be most valuable (and actionable) if developed in collaboration with active researchers. Moving forward, we hope to continue to engage with the neuroimaging community on issues related to RDM and also expand the scope of our survey to other research communities such as psychology and biomedical science.

Additional Reading

Our preprint, one more time:

Borghi, J. A., & Van Gulick, A. E. (2018). Data management and sharing in neuroimaging: Practices and perceptions of MRI researchersbioRxiv.

For a primer on functional magnetic resonance imaging:

Soares, J. M., Magalhães, R., Moreira, P. S., Sousa, A., Ganz, E., Sampaio, A., … Sousa, N. (2016). A hitchhiker’s guide to functional magnetic resonance imagingFrontiers in Neuroscience10, 1–35.

For more on rigor, reproducibility, and neuroimaging:

Nichols, T. E., Das, S., Eickhoff, S. B., Evans, A. C., Glatard, T., Hanke M., … Yeo, B. T. T. (2017). Best practices in data analysis and sharing in neuroimaging using MRINature Neuroscience, 20(3), 299–303. (Preprint)

Poldrack, R. A., Baker, C. I., Durnez, J., Gorgolewski, K. J., Matthews, P. M., Munafò, M. R., … Yarkoni, T. (2017). Scanning the horizon: Towards transparent and reproducible neuroimaging researchNature Reviews Neuroscience, 18(2), 115–126. (Preprint)