Catalogs and Indices for Finding (Scientific) Software

The problem

I’ve been thinking about software catalogs for a while, mostly since the program I lead at NSF (Software Infrastructure for Sustained Innovation, SI2, see the second half of http://nsf.gov/si2) has funded a fairly large number of projects, around 100.  Many of these projects are producing and supporting a piece of software, but some contribute to multiple pieces of software, and in some cases, multiple projects contribute to the same piece of software.  In addition, some of the projects were funded long enough ago that they have ended or will end soon.

The issue of catalogs or indices is important in multiple ways:

  • As a funder, I would like to have a list of the products that we are supporting or have supported, tied to the projects that we have funded.
  • As a user, I would like to know what software is likely to be around for a while, and NSF funding (particularly with the period of funding shown) would be a useful thing to know.
  • As a developer, I want others to be able to easily find (and then both use and cite) my software.

What happens in other areas?

In the publishing and web content worlds, this is somewhat similar to the roles of a publisher/aggregator/distributor, a content consumer, and a content creator.

Important artifacts are:

  • The identifiers of each item of content that enable the items to be uniquely and consistently identified: e.g., DOI, ISBN, ISSN, URL
  • The catalogs and indices that store content identifiers and related metadata (e.g., creator, publication date, format), and allow consumers to find them: e.g., publisher catalog, journal table of contents, curated pages of links, best-seller lists, Google, Google Scholar
  • The services and tools that generate such catalogs and indices, whether public or private: e.g., Google Scholar, Mendeley, university profiles and knowledge systems

What some government programs are doing

We (the SI2 program) have talked about this a lot, and what we decided to do, partially as a compromise between effort and reward, is a google sites page: http://bit.ly/sw-ci

NIH, as part of its Big Data to Knowledge (BD2K) initiative, held a “Software Discovery Index” workshop in May 2014, with a report available.

The Department of Energy’s Office of Science’s X-Stack program maintains a wiki to track its current projects.

DARPA has an open catalog, and other DARPA programs may join this catalog. NASA’s code.nasa.gov is based on the same idea.

All of these have value, but none are really fully satisfying.  There are issues with both the SI2 page and the X-Stack page with old projects either needing to be manually removed or being forcibly removed (listing is tied to funding, not to the ongoing value of the project).  Centralized maintenance of these pages seems to be part of the problem.  The DARPA and NASA catalogs link to software repos, but not DOIs, and require metadata JSON files to be placed in a particular GitHub repository, which is used to build the catalogs.

While not a government program, freshmeat/freecode is an interesting example of a catalog failing due to low utilization and perhaps, impractical maintenance. [pointer from James Howison]

Proposed solution

Perhaps we can build an infrastructure that allows content creators and users both to contribute?  Here’s a set of ideas that might make this work:

  • Repositories provide a simple option to publish a release, like one can do from GitHub and Zenodo, but in fewer steps and with fewer changes of servers. (Note: I assume that publication of software is a conscious decision that is made at particular times — releases.  An alternate model that could also be explored is that publication happens by default as software is checked in to repositories.)
  • The person who publishes a release fills in metadata about the release (as is done on the Zenodo page now.)   These releases would then be assigned DOIs. And they would become catalog entries through the work of a crawler or future system like Google Scholar (let’s call it OSC – open software catalog – for now).  Additionally, they could also be crawled or curated for discipline specific purposes.
    • Generating metadata for repositories is a larger challenge than just for software. The idea of user-generated metadata, as suggested here and by others, is a potential answer.
    • In order for user-generated metadata to work, we as a community need to agree on the standards for such metadata (equivalent to work going on for data in the Research Data Alliance.)  Ideally, there would be a minimal set of required metadata, with optional extensions, perhaps even discipline-specific options.
    • A related question is how much of the needed metadata can be autogenerated?  For example, commits could be used to generate author lists, or at least a starting point for a person to use in defining the authors. (Note that doing this kind of autogeneration is useful in credit in general; see Implementing Transitive Credit with JSON-LD and Project CRedIT as example of where it could be used.)
    • DOAP RDF file could be used to store this information.
  • The person who publishes a release can optionally indicate the funding source (with the funding end date), which is added to the metadata, and visible perhaps as a badge/flag, once the funder affirms that it is correct.  When the end date passes, the badge/flag would change to indicate prior funding.  If the projects gets new funding, a new badge/flag can be added.
  • Quality metrics could be also generated, either by peers or by automated testing.
    • Users could give ratings and discuss successes and failures with the software (as is done in NanoHub and App Stores).  Citations to the software from papers could also automatically be generated.
    • Quality metrics could be automatically generated from developer-provided test suites.  Note that this gets back to the question of standard practices as in the DARPA open catalog – it would be better if this was directly supported within the repositories rather than by the catalogs).  This would also potentially address the issue of bit rot (somewhat related to an idea from James Howison.)
  • People looking for software to use or to further develop can search the catalog.  It’s unclear how much this would happen vs. just talking to colleagues to see what they use, but perhaps these badges/flags/scores would make this more common.
  • A related idea is that catalog could be partially automatically generated, and partially curated.  This idea builds on the idea of a software wikipedia to enable developers to know who is doing what and collaborate rather than compete (even unintentionally), as was suggested by one of the attendees of the 2015 SI2 PI workshop. (Sorry I don’t remember who suggested this.)

Others discussing (or even working on) this

(I’m happy to add more links here – please email me)

Acknowledgements

Thanks for Amy Friedlander, Martin Fenner, David Proctor, James Howison, Chris Mattman, and Arfon Smith for useful comments on a earlier draft of this blog.  The suggestions have been very help, and any errors are solely mine.

Disclaimer

Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.

Comments are welcome.

Advertisements

Published by:

danielskatz

Assistant Director for Scientific Software and Applications at NCSA, Research Associate Professor in CS, ECE, and the iSchool at the University of Illinois Urbana-Champaign; works on systems and tools (aka cyberinfrastructure) and policy related to computational and data-enabled research, primarily in science and engineering

Tags , 10 Comments

10 thoughts on “Catalogs and Indices for Finding (Scientific) Software”

  1. Some other angles:

    – think about publications as a storehouse to be parsed to say something about software. I chatted about this with http://www.cse.unt.edu/~ccaragea/ last week, she’s been summarizing papers based on how they are mentioned in the text of articles, something similar could be done with software (once we can recognize software mentions, of course). We have a paper with a small, but hopefully useful, dataset of actual mentions of software in the literature which could be mined (and we’re working on that, but the data is available to others): https://github.com/jameshowison/softcite

    – Doesn’t the NSF require mention of tools in annual reports? Why not require something like the Apache DOAP file generation form for reporting these tools? Then build the catalogue from that. https://projects.apache.org/create.html

    – Collections of scientific software automatically analyzed. Clearly this putting the cart before the horse, but it does point out another advantage of open scientific workflows: they can be analyzed later. There is a storehouse of data our there in execution environments like Galaxy, but the predominant approach hasn’t seen access for analysis as a quid pro quo. The XALT project has potential here: https://github.com/Fahey-McLay/xalt and http://dx.doi.org/10.1109/HUST.2014.6

    Like

  2. Great blogpost.

    Are the more common academic repositories, sites like ResearchGate, Academia.edu, Mendeley—are these set up to handle cyberinfrastructure?

    I imagine that I could upload a one-page notice to all of these linking to my software project (and ResearchGate would generate a doi for it) and then the software would be readily findable on Google and Google Scholar. The site figshare also generates doi’s, but I’m not sure how popular figshare is, so it may not be around for too long.

    My initial thought is that someone will create a website for housing academic software, and that ResearchGate will copy it, with greater effect.

    What thoughts?

    Like

  3. Great overview of the issues, Dan — we really need a better system for software metadata. Hopefully community convergence for software metadata won’t be as painful as it has been for metadata for datasets.

    You might be interested in our work on minimal metadata for software through the codemeta github repository (https://github.com/mbjones/codemeta), which is part of the Code as a Research Object effort at Mozilla Science lab. In particular, we are working on a crosswalk of various software metadata specifications (https://github.com/mbjones/codemeta/blob/master/codemeta-crosswalk.md) so that we might come up with an agreed vocabulary, and hopefully one that is aligned with related metadata for data efforts. As usual, different groups have varying ideas of what ‘minimal’ should be, and different preferences for serialization languages for the metadata (json, rdf, etc).

    You might want to mention the Earth Cube GeoSoft project as another that is working on a software catalog (http://www.geosoft-earthcube.org/).

    Like

  4. Very timely topic. There are some probably relevant prior activities on the math/numerical side of things. The most systematic might be

    http://gams.nist.gov/

    and more narrowly

    http://toms.acm.org/

    with similar organization of content around other journal families such as CPC.

    And there are of course things closer to archives rather than a catalog such as

    http://www.netlib.org/

    And finally there are integrated libraries such as the Gnu Scientific Library

    http://www.gnu.org/software/gsl/

    Reusing software, esp. detailed numerical algorithms such as for special function evaluation, is extending a lot of trust that is usually not warranted by the design, curation and overall quality of most software sources.

    I very much agree that there is potentially huge value in finding new mechanisms for us to turn academic software development into an even more communal activity than it is now. And increased reuse need not be the only or even the main objective — sharing skills and knowledge will have a greater return than just sharing code.

    Robert

    Like

  5. Good post! You mention nanoHUB.org as an example of a site doing reviews, but it is also an example of a catalog. nanoHUB.org has 360+ nanotechnology tools that are citable via Digital Object Identifiers. All of these are tools you can run right on the site via a browser. If the tools are published with an open source license, then source code bundles are available for download too. I think that’s the best kind of catalog–not just a listing of tools, but an environment built to actually run the tools with the push of a button.

    Of course, nanoHUB.org is just one of more than 60+ sites built on the HUBzero platform. All of those sites have the capability of hosting tools for their community, and there are many hubs, including NEES.org and pharmaHUB.org, with good collections of tools. I think HUBzero is the platform that you’re asking for, which supports publication of software with Digital Object Identifiers, with ORCID IDs for authors, with built-in project areas and source code control repositories, and with a live system that lets you not just find the tools, but actually use them. Check it out: https://www.youtube.com/watch?v=p_hpwoiXvEE or http://hubzero.org

    Like

    1. Hi Michael,
      Thanks for this comment. I implicitly considered nanoHUB as a catalog, and in particular, one that is built by the submitters, but wanted to emphasize the review part to separate it from other catalogs.
      Dan

      Like

  6. Great post. Another project that goes also into this direction is NumFOCUS. The approach is a bit different though.
    http://numfocus.org/
    They are more focused on Python Code but this project looks also very interesting, for example with having a working group for applying for grants and get some financial means.

    Regarding the catalogue, I think it would be very helpful if it is required to have registered example data sets and expected results for each software in the catalogue. I find it always helpful when I use open-source software to be able to test it with the provided examples and check the results. There are simulations, of course, which won’t deliver always exact the same data but the range should be similar.

    Sandra

    Like

  7. Very nice post. I wanted to share two efforts from the biomedical informatics program initiatives at NCI:

    Based on the catalogs of DARPA and NASA, as part of an open-development initiative we have established and initiated the deposit of open-source code for biomedical informatics software applications in GitHub that can be viewed via the established channel at https://github.com/ncip. We are encouraging grantees to deposit NCI funded software applications to this repository to make them more discoverable.

    We also have a project in the NCIP Hub (https://www.nciphub.org). This community development project leverages the HUBzero infrastructure just like nanoHUB and hopes to empower the cancer informatics community, both content creators and users, to contribute towards a catalog, be it software tools, data, standards or other relevant digital assets and mutually benefit by sharing, collaborating and learning from each other to accelerate their own research. Contributors can not only share open source code and get DOIs assigned but also collaborate on application development, run applications as well as get community input, all under one umbrella.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s