Showing posts with label linked data. Show all posts
Showing posts with label linked data. Show all posts

Tuesday, December 22, 2015

xISBN: RIP


When I joined OCLC in 2006 (via acquisition), one thing I was excited about was the opportunity to make innovative uses of OCLC's vast bibliographic database. And there was an existence proof that this could be done, it was a neat little API that had been prototyped in OCLC's Office of Research: xISBN.

xISBN was an example of a microservice- it offered a small piece of functionality and it did it very fast. Throw it an ISBN, and it would give you back a set of related ISBNs. Ten years ago, microservices and mashups were all the rage. So I was delighted when my team was given the job of "productizing" the xISBN service- moving it out of research and into the marketplace.

Last week,  I was sorry to hear about the imminent shutdown of xISBN. But it got me thinking about the limitations of services like xISBN and why no tears need be shed on its passing.

The main function of xISBN was to say "Here's a group of books that are sort of the same as the book you're asking about." That summary instantly tells you why xISBN had to die, because any time a computer tells you something "sort of", it's a latent bug. Because where you draw the line between something that's the same and something that's different is a matter of opinion and depends on the use you want to make of the distinction. For example, if you ask for A Study in Scarlet, you might be interested in a version in Chinese, or you might be interested to get a paperback version, or you might want to get Sherlock Holmes compilations that included A Study in Scarlet. For each  question you want a slightly different answer. If you are a developer needing answers to these questions, you would combine xISBN with other information services to get what you need.

Today we have better ways to approach this sort of problem. Serious developers don't want a microservice, they want richly "Linked Data". In 2015, most of us can all afford our own data crunching big-data-stores-in-the-cloud and we don't need to trust algorithms we can't control. OCLC has been publishing rather nice Linked Data for this purpose. So, if you want all the editions for Cory Doctorow's Homeland, you can "follow your nose" and get all the data you need.

  1. First you look up the isbn at http://www.worldcat.org/isbn/9780765333698
  2. which leads you to http://www.worldcat.org/oclc/795174333.jsonld (containing a few more isbns
  3. you can follow the associated "work" record: http://experiment.worldcat.org/entity/work/data/1172568223
  4. which yields a bunch more ISBNs.

It's a lot messier than xISBN, but that's mostly because the real world is messy. Every application requires a different sort of cleaning up, and it's not all that hard.

If cleaning up the mess seems too intimidating, and you just want light-weight ISBN hints from a convenient microservice, there's always "thingISBN". ThingISBN is a data exhaust stream from the LibraryThing catalog. To be sustainable, microservices like xISBN need to be exhaust streams. The big cost to any data service is maintaining the data, so unless maintaining that data is in the engine block of your website, the added cost won't be worth it. But if you're doing it anyway, dressing the data up as a useful service costs you almost nothing and benefits the environment for everyone. Lets hope that OCLC's Linked Data services are of this sort.

In thinking about how I could make the data exhaust from Unglue.it more ecological, I realized that a microservice connecting ISBNs to free ebook files might be useful. So with a day of work, I added the "Free eBooks by ISBN" endpoint to the Unglue.it api.

xISBN, you lived a good micro-life. Thanks.

Sunday, July 31, 2011

Library Data Beyond the Like Button

"Aren't you supposed to be working on your new business? That ungluing ebooks thing? Instead you keep writing about library data, whatever that is. What's going on?"

No, really, it all fits together in the end. But to explain, I need to talk you beyond the "Like Button".

Earlier this month, I attended a lecture at the New York Public Library. The topic was Linked Open Data, and the speaker was Jon Voss, who's been applying this technology to historical maps. It was striking to see how many people from many institutions turned out, and how enthusiastically Jon's talk was received. The interest in Linked Data was similarly high at the American Library Association Meeting in New Orleans, where my session (presented with Ross Singer of Talis) was only one of several Linked Data sessions that packed meeting rooms and forced attendees to listen from hallways.

I think it's important to convert this level of interest into action. The question is, what can be done now to get closer to the vision of ubiquitous interoperable data? My last three posts have explored what libraries might do to better position their presence in search engines and in social networks using schema.org vocabulary and Open Graph Protocol. In these applications, library data enables users to do very specific things on the web- find a library page in a search engine or "Like" a library page in a Facebook. But there's so much more that could be done with the data.

I think that library data should be handled as if it was made of gold, not of diamond.

Perhaps the most amazing property of gold is its malleability. Gold can be pounded into a sheet so thin that it's transparent to light. An ounce of gold can be made into leaf that will cover 25 square meters.

There is a natural tendency to treat library data as a gem that needs skillful cutting and polishing. The resulting jewel will be so valuable that users will beat down library websites to get at the gems. Yeah.

The reality is that  library data in much more valuable as a thin layer that covers huge swaths of material. When data is spread thinly, it has a better chance of connecting with data from other libraries and with other sorts of institutions: Museums, archives, businesses, and communities. By contrast, deep data, the sort that focuses on a specific problem space, is unlikely to cross domains or applications without a lot of custom programming and data tweaking.

Here's the example that's driven my interest in opening up library linked data: At Gluejar, we're building a website that will ask people to go beyond "liking" books. We believe that books are so important to people that they will want to give them to the world; to do that we'll need to raise money. If lots of people join together around a book, it will be easy to raise the money we need, just as public radio stations find enough supporters to make the radio free to everyone.

We don't want our website to be a book discovery website, or a social network of readers, or a library catalog; other sites to that just fine. What we need is for users to click "support this book" buttons on all sorts of websites, including library catalogs. And our software needs to pull just a bit of data off of a webpage to allow us to figure out which book the user wants to support. It doesn't sound so difficult. But we can only support to or three different interfaces to that data. If library websites all put a little more structured data in their HTML, we could do some amazing things. But they don't, and we have to settle for "sort of works most of the time".

Real books get used in all sorts of ways. People annotate them, they suggest them to friends, they give them away, they quote them, and they cite them. People make "TBR" piles next to their beds. Sometimes, they even read and remember them as long as they live. The ability to do these same things on the web would be pure gold.

Tuesday, July 12, 2011

Spoonfeeding Library Data to Search Engines

CC-NC-BY rocketship
When you talk to a search engine, you need to realize that it's just a humongous baby. You can't expect it to understand complicated things. You would never try to teach language to a human baby by reading it Nietzsche, and you shouldn't expect a baby google to learn bibliographic data by feeding it MARC (or RDA or METS or MODS, or even ONIX).

When a baby says "goo-goo" to you, you don't criticize its misuse of the subjunctive. You say "goo-goo" back. When Google tells you that that it wants to hear "schema.org" microdata, you don't try to tell it about the first indicator of the 856 ‡u subfield. You give it schema.org microdata, no matter how babyish that seems.

It's important to build up a baby's self-confidence. When baby google expresses interest in the number of pages of a book, you don't really want to be specifying that there are ix pages numbered with roman numerals and 153 pages with arabic numerals in shorthand code. When baby google wants to know whether a book is "family friendly" you don't want to tell it about 521 special audience characteristics, you just want to tell it whether or not it's porn.

If you haven't looked at the schema.org model for books, now's a good time. Don't expect to find a brilliant model for book metadata, expect to find out what a bibliographic neophyte machine thinks it can use a billion times a day. Schema.org was designed by engineers from Google, Yahoo, and Bing. Remember, their goal in designing it was not to describe things well, it was to make their search results better and easier to use.

The thing is, it's not such a big deal to include this sort of data in a page that comes from an library OPAC (online catalog). An OPAC that publishes unstructured data produces HTML that looks something like this:
<div> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

The first step is to mark something as the root object. You do that with the itemscope attribute:
<div itemscope> 
<h1>Avatar</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

A microdata-aware search engine looking at this will start building a model. So far, the model has one object, which I'll denote with a red box.


The second step, using microdata and Schema.org, is to give the object a type. You do that with the itemtype attribute:
<div itemscope itemtype="http://schema.org/Book"> 
<h1>Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: Paul Bryers (born 1945)</span> 
 <span>Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Now the object in the model has acquired the type "Book" (or more precisely, the type "http://schema.org/Book".

Next, we give the Book object some properties:
<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: 
<span itemprop="author">Paul Bryers (born 1945)</span></span> 
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg">
</div>

Note that while the library record for this book attempts to convey the title complexity: "245 10 $aAvatar /$cPaul Bryers.$", the search engine doesn't care yet. The book is part of a series: 490 1 $aThe mysteries of the Septagram$, and the search engines don't want to know about that either. Eventually, they'll learn.
The model built by the search engine looks like this:

So far, all the property values have been simple text strings. We can also add properties that are links:
<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <span>Author: 
<span itemprop="author">Paul Bryers (born 1945)</span></span> 
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg" 
itemprop="image">
</div>
The model grows.

Finally, we want to say that the author, Paul Bryers, is an object in his own right. In fact, we have to, because the value of an author property has to be a Person or an Organization in Schema.org. So we add another itemscope attribute, and give him some properties:
<div itemscope itemtype="http://schema.org/Book"> 
<h1 itemprop="name">Avatar (Mysteries of Septagram, #2)</h1>
 <div itemprop="author" itemscope itemtype="http://schema.org.Person">
Author:  <span itemprop="name">Paul Bryers</span> 
(born <span itemprop="birthDate">1945</span>)
 </div>
 <span itemprop="genre">Science fiction</span>
 <img src="http://coverart.oclc.org/ImageWebSvc/oclc/+-+703315758_140.jpg" 
itemprop="image">
</div>

That wasn't so hard. Baby has this picture in his tyrannical little head:

Which it can easily turn into a "rich snippet" that looks like this:

Though you know all it really cares about is milk.

Here's a quick overview of the properties a Schema.org/Book can have (the values in parentheses indicate a type for the property value):

Properties from http://schema.org/Thing
  • description
  • image(URL)
  • name
  • url(URL)
Properties from http://schema.org/CreativeWork
Properties from http://schema.org/Book
This post is the second derived from my talk at ALA in New Orleans. The first post discussed the changing role of digital surragates in a fully digital world. The next will discuss "Like" buttons.

Friday, July 8, 2011

Library Data: Why Bother?

When face recognition came out in iPhoto, I was amused when it found faces in shrubbery and asked me whether they were friends of mine. iPhoto, you have such a sense of humor!

But then iPhoto looked at this picture of a wall of stone faces in Baoding, China. It highlighted one of the faces and asked me "Is this Jane?" I was taken aback, because the stone depicted Jane's father. iPhoto was not as stupid as I thought it was- it could even see family resemblances.

Facial recognition software is getting better and better, which is one reason people are so worried about the privacy implications of Facebook's autotagging of pictures. Imagine what computers will be able to do with photos in 10 years! They'll be able to recognize pictures of bananas, boats, beetles and books. I'm thinking it's probably not worth it to fill in a lot of iPhoto metadata.

I wish I had thought about facial recognition when I was preparing my talk for the American Library Association Conference in New Orleans. I wanted my talk to motivate applications for Linked Open Data in libraries, and in thinking about why libraries should be charting a path towards Linked Data, I realized that I needed to examine first of all the motivation for libraries to be in the bibliographic data business in the first place.

Originally, libraries invested in bibliographic data to help people find things. Libraries are big and have a lot of books. It's impractical for library users to find books solely by walking the stacks, unless the object of the search has been anticipated by the ordering of books on the shelves. The paper cards in the card catalog could be easily duplicated to enable many types of search in one compact location. The cards served as surrogates for the physical books.

When library catalogs became digital, much more powerful searches could be done. The books acquired digital surrogates that could be searched with incredible speed. These surrogates could be used for a lot of things, including various library management tasks, but finding things was still the biggest motivation for the catalog data.

We're now in the midst of a transition where books are turning into digital things, but cataloging data hasn't changed a whole lot. Libraries still need their digital surrogates because most publishers don't trust them with the full text of books. But without full text, libraries are unable to provide the full featured discovery that a a search engine with access to both the full text and metadata (Google, Overdrive, etc.) can provide.

At the same time, digital content files are being packed with more and more metadata from the source. Photographs now contain metadata about where, when and how they were taken; for a dramatic example of how this data might be used, take a look at this study from the online dating site OKCupid. Book publishers are paying increased attention to title-level metadata, and metadata is being built into new standards such as EPUB3. To some extent, this metadata is competing for the world's attention with library-sourced metadata.

Libraries have two paths to deal with this situation. One alternative is to insist on getting the full text for everything they offer. (Unglued ebooks offer that, that's what we're working on at Gluejar.)

The other alternative for libraries is to feed their bibliographic data to search engines so that library users can discover books in libraries. Outside libraries, this process is known as "Search Engine Optimization". When I said during my talk that this should be the number one purpose of library data looking forward, one tweeter said it was "bumming her out". If the term "Search Engine Optimization" doesn't work for you, just think of it as "helping people find things".

Library produced data is still important, but it's not essential in the way that it used to be. The most incisive question during my talk pointed out that the sort of cataloging that libraries do is still absolutely essential for things like photographs and other digital archival material. That's very true, but only because automated analysis of photographs and other materials is computationally hard. In ten years, that might not be true. iPhoto might even be enough.

In the big picture, very little will change: libraries will need to be in the data business to help people find things. In the close-up view, everything is changing- the materials and players are different, the machines are different, and the technologies can do things that were hard to imagine even 20 years ago.

In a following post, I'll describe ways that libraries can start publishing linked data, feeding search engines, and keep on helping people find stuff. The slides from my talk (minus some copyrighted photos) are available as PDF (4.8MB) and PPTX (3.5MB).

Sunday, June 27, 2010

Global Warming of Linked Data in Libraries

Libraries are unusual social institutions in many respects; perhaps the most bizarre is their reverence for metadata and its evangelism. What other institution considers the production, protection and promulgation of metadata to be part of its public purpose?

The W3C's Linked Data activity shares this unusual mission. For the past decade, W3C has been developing a technology stack and methodology designed to support the publication and reuse of metadata; adoption of these technologies has been slow and steady, but the impact of this work has fallen short of its stated ambitions.

I've been at the American Library Association's Annual Meeting this weekend. Given the common purpose of libraries and Linked Data, you would think that Linked Data would be a hot topic of discussion. The weather here has been much hotter than Linked Data, which I would describe as "globally warming". I've attended two sessions covering Linked Data, each attended by between 50 and 100 delegates. These followed a day long, sold-out  preconference. John Phipps, one of the leaders in the effort to make library metadata compatible with the semantic web, remarked to me that these meeting would not have been possible even a year ago. Still, this attendance reflects only a tiny fraction of metadata workers at the conference; Linked Data has quite a ways to come. It's only a few months ago that the W3C formed a Library Linked Data Incubator Group.

On Friday morning, there was an "un-conference" organized by Corey Harper from NYU and Karen Coyle, a well-known consultant. I participated in a subgroup looking at use cases for library Linked Data. It took a while for us to get around to use cases though, as participants described that usage was occurring, but they weren't sure what for. Reports from OCLC (VIAF) and Library of Congress (id.loc.gov) both indicated significant usage but little feedback. The VIVO project was described as one with a solid use case (giving faculty members a public web presence), but no one from VIVO was in attendance.

On Sunday morning, a meeting of the Association for Library Collections and Technical Services (ALCTS), Rebecca Guenther, Library of Congress, discussed id.loc.gov, a service that enables both humans and machines to programatically access authority data at the Library of Congress. Perhaps the most significant thing about id.loc.gov is not what it does but who is doing it. The Library of Congress provides leadership for the world of library cataloguing; what LC does is often slavishly imitated in libraries throughout the US and the rest of the world.  id.loc.gov started out as a research project but is now officually supported.

Sara Russell-Gonzalez of the University of Florida then presented the VIVO which has won a big chunk of funding from the National Center for Research Resources, a branch of NIH. The goal of VIVO is to build an "interdisciplinary national network enabling collaboration and discovery between scientists across all disciplines." VIVO started at Cornell and has garnered strong institutional support there, as evidenced by an impressive web site. If VIVO is able to gain similar support nationally and internationally, it could become an important component of an international research infrastructure. This is a big "if". I asked if VIVO had figured out how to handle cases where researchers change institutional affiliations; the answer was "No". My question was intentionally difficult; Ian Davis has written cogently about the difficulties RDF has in treating time-dependent relationships. It turns out that there are political issues as well. Cornell has had to deal with a case where an academic department wanted to expunge affiliation data for a researcher who left under cloudy circumstances.

At the un-conference, I urged my breakout group to consider linked data as a way to expose library resources outside of the library world as well as a model for use inside libraries. It's striking to me that libraries seem so focused on efforts such as RDA, which aim to move library data models into Semantic Web compatible formats. What they aren't doing is to make library data easily available in models understandable outside the library.

The two most significant applications of Linked Data technologies so far are Google's Rich Snippets and Facebook's Open Graph Protocol (whose user interface, the "Like" button, is perhaps the semantic webs most elegant and intuitive). Why aren't libraries paying more attention to making their OPAC results compatable with these application by embedding RDFa annotations in their web-facing systems? It seems to me that the entire point of metadata in libraries is to make collections accessible. How better to do this than to weave this metadata into peoples lives via Facebook and Google? Doing this will require the dumbing-down of library metadata and some hard swallowing, but it's access, not metadata quality, that's core to the reason that libraries exist.



Enhanced by Zemanta

Tuesday, May 4, 2010

Authors are Not People: ORCID and the Challenges of Name Disambiguation

In 1976, Robert E. Casey, the Recorder of Deeds of Cambria County, Pennsylvania, let his bartender talk him into running for State Treasurer. He didn't take the campaign very seriously, in fact, he went on vacation instead. Nonetheless, he easily defeated the party-endorsed candidate in the Democratic Primary and went on to win the general election. It seems that voters thought they were voting for Robert P. Casey, a popular former State Auditor General and future Governor.

Robert P. Casey almost won the Pennsylvania Lieutenant Governor's race in 1978. No, not that Robert P. Casey, this Robert P. Casey was a former teacher and ice cream salesman. Robert P. Casey, Jr., the son of the "real" Robert P. Casey, was elected to the United States Senate in 2006. Name disambiguation turns out to be optional in politics.

That's not to say ambiguous names don't cause real problems. My name is not very common, but still I occasionally get messages meant for another Eric Hellman. A web search on a more common name like "Jim Clark" will return results covering at least eight different Jim Clarks. You can often disambiguate the Jim Clarks based on their jobs or place of residence, but this doesn't always work. Co-authors of scholarly articles with very similar or even identical names are not so uncommon- think of father-son or husband-wife research teams.

The silliest mistake I made in developing an e-journal production system back when I didn't know it was hard was to incorrectly assume that authors were people. My system generated webpages from a database, and each author corresponded to a record in the database with the author's name, affiliations, and a unique key. Each article was linked to the author by unique key, and each article's title page was generated using the name from the author record. I also linked the author table to a database of cited references; authors could add their published papers to the database. Each author name was hyperlinked to a list of all the author's articles.

I was not the first to have this idea. In 1981, Kathryn M. Soukup and Silas E. Hammond of the Chemical Abstracts Service wrote:
If an author could be "registered" in some way, no matter how the author's name appeared in a paper, all papers by the author could automatically be collected in one place in the Author Indexes.

Here's what I did wrong: I supposed that each author should be able to specify how their name should be printed; I always wanted my name on scientific papers to be listed as "E. S. Hellman" so that I could easily look up my papers and citations in the Science Citation Index. I went a bit further, though. I reasoned that people (particularly women) sometimes changed their names, and if they did so, my ejournal publishing system would happily change all instances of their name to the new name. This was a big mistake. Once I realized that printed citations to old papers would break if I retroactively changed an author's name, I made author name immutable for each article, even when the person corresponding to the author changed her name.

Fifteen years later, my dream of a cross-publication author identifier may be coming true. In December, a group of organizations led by Thomson Reuters (owners of the Web of Knowledge service that is the descendent of the Science Citation Index) and the Nature Publishing Group announced (pdf, 15kB) the creation of an effort to create unique identifiers for scientific authors. Named ORCID, for Open Researcher & Contributor ID, the organization will try to turn Thomson Reuters' Researcher ID system into an open, self-sustaining non-profit service for the scholarly publishing, research and education communities.

This may prove to be more challenging than it sounds, both technically and organizationally. First, the technical challenges. There are basically three ways to attack the author name disambiguation problem: algorithmically, manually, and socially.

The algorithmic attack, which has long history, has been exploited on a large scale by Elsevier's SCOPUS service, so the participation of Elsevier in the ORCID project bodes well for its chances of success. Although this approach has gone a long way, algorithms have their limits. They tend to run out of gas when faced with sparse data; it's estimated that almost half of authors have their names appear only once on publications.

The manual approach to name disambiguation turns out not to be as simple as you might think. Thomson Reuters's ISI division has perhaps the longest experience with this problem, and the fact that they're leading the effort to open name disambiguation to their competitors suggests that they've not found any magic bullets. Neil R. Smalheiser and Vetle I. Torvik have published an excellent review of the entire field (Author Name Disambiguation, pdf 179K) which includes this assessment:
... manual disambiguation is a surprisingly hard and uncertain process, even on a small scale, and is entirely infeasible for common names. For example, in a recent study we chose 100 names of MEDLINE authors at random, and then a pair of articles was randomly chosen for each name; these pairs were disambiguated manually, using additional information as necessary and available (e.g., author or institutional homepages, the full-text of the articles, Community of Science profiles (http://www.cos.com), Google searches, etc.). Two different raters did the task separately. In over 1/3 of cases, it was not possible to be sure whether or not the two papers were written by the same individual. In a few cases, one rater said that the two papers were “definitely by different people” and the other said they were “definitely by the same person”!
(Can it be a coincidence that so much research in name disambiguation is authors by researchers with completely unambiguous names?)

The remaining approach to the author name problem is to involve the authoring community, which is the thrust of the ORCID project. Surely authors themselves know best how to disambiguate their names from others! There are difficulties with this approach, not the least of which is to convince a large majority of authors to participate in the system. That's why ORCID is being structured as a non-profit entity with participation from libraries, foundations and other organizations in addition to publishers.

In addition to the challenge of how to gain acceptance, there are innumerable niggling details that will have to be addressed. What privacy expectations will authors demand? How do you address publications by dead authors? How do you deal with fictitious names and pseudonyms? What effect will an author registry have on intellectual property rights? What control will authors have over their data? How do you prevent an author from claiming another's publications to improve their own publication record? How do you prevent phishing attacks? How should you deal with non-roman scripts and transliterations?

Perhaps the greatest unsolved problem for ORCID is its business model. If it is to be self-sustaining, it must have a source of revenue. The group charged with developing ORCID's business model are currently looking at memberships and grants as the most likely source of funds, recognizing that the necessity for broad author participation precludes author fees as a revenue source. ORCID commercial participants hope to use ORCID data to pull costs out of their own processes, to fuel social networks for authors or to drive new or existing information services. Libraries and research foundations hope to use ORCID data to improve information access, faculty rankings and grant administration processes. All of these applications will require that restrictions on the use of ORCID data must be minimal, limiting ORCID's ability to offer for-fee services. The business conundrum for ORCID is very similar to that faced by information producers who are considering publication of  Linked Open Data.

ORCID will need to navigate between the conflicting interests of its participants. CrossRef, which I've written about frequently, has frequently be cited as a possible model for the ORCID organization. (CrossRef has folded its Contributor ID project into ORCID.) The initial tensions among CrossRef's founders, which resulted from the differing interests of large and small publishers, primary and second publishers, and commercial and nonprofit publishers, may seem comparatively trivial when libraries, publishers, foundations and government agencies all try to find common purpose in ORCID.

It's worth imagining what an ORCID and Linked Data enabled citation might look like in ten years. In my article on linking architecture, I used this citation as an example:
D. C. Tsui, H. L. Störmer and A. C. Gossard, Phys. Rev. Lett. 48, 1559 (1982).
Ten years from now, that citation should have three embedded ORCID identifiers (and will arrive in a tweet!). My Linked Data enabled web browser will immediately link the ORCID ids to wikipedia identifiers for the three authors (as simulated by the links I've added). I'll be able find all the articles they wrote together or separately, and I'll be able to search all the articles they've written. My browser would immediately see that I'm friends with two of them on Facebook, and will give me a list of articles they've "Liked" in the last month.

You my find that vision to be utopian or nightmarish, but it will happen, ORCID or not.

More ORCID and author ID, and name disambiguation links:
Photo of the "real" Robert P Casey taken by Michael Casey, 1986, licensed under the Creative Commons Attribution 2.5 Generic license.
Enhanced by Zemanta

Wednesday, April 28, 2010

Pick this Nit: Null Path URIs and the Pedantic Web

There is no surer way to flush out software bugs and configuration errors than to do a sales demo. The process not only exposes the problem, but also sears into the psyche of the demonstrator an irrational desire to see the problem eradicated from the face of the earth, no matter the cost or consequences.
Here's a configuration problem I once found while demonstrating software to a potential customer:
Many library information services can be configured with the base URL for the institution's OpenURL server. The information service then constructs links by appending "?" and a query string onto the base URL. So for example, if the base URL is
http://example.edu/links
and the query string is
isbn=9780393072235&title=The+Big+Short ,
the constructed URL is
http://example.edu/links?isbn=9780393072235&title=The+Big+Short.
For the demo, we had configured the base URL to be very short: http://example.edu, so the constructed URL would have been http://example.edu?isbn=9780393072235&title=The+Big+Short. Everything worked fine when we tested beforehand. For the customer demo, however, we used the customer's computer, which was running some Windows version of Internet Explorer that we hadn't tested, and none of the links worked. Internet Explorer had this wonderful error page that made it seem as if our software had broken the entire web. Luckily, breaking the entire web was not uncommon at the time, and I was able to navigate to a different demo site and make it appear is if I had fixed the entire web, so we managed to make the sale anyway.
It turns out that http URLs with null paths aren't allowed to have query strings. You wouldn't know it if you looked at the W3C documentation for URIs, which is WRONG, but you will see it if you look at the IETF specs, which have jurisdiction (see RFC 1738 and RFC 2616).
Internet Explorer was just implementing the spec, ignoring the possibility that someone might ignore or misinterpret it. The fact that Netscape worked where IE failed could be considered a bug or a feature, but most users probably considered Netscape's acceptance of illegal URLs to be a feature.
I still feel a remnant of  pain every time I see a pathless URL with a query string. Most recently, I saw a whole bunch of them on the thing-described-by site and sent a nit-picky e-mail to the site's developer, and was extremely pleased when he fixed them. (Expeditious error fixing will be richly rewarded in the hereafter.) I've come to recognize, however, that a vast majority of these errors will never be fixed or even noticed, and maybe that's even a good thing.
Nit picking appears to have been a highlight of the Linked Data on the Web Meeting in Raleigh, NC yesterday, which I've followed via Twitter. If you enjoy tales of nerdy data disasters or wonky metadata mischief, you simply must peruse the slides from Andreas Harth's talk (1.8M, pdf) on "Weaving the Pedantic Web". If you're serious about understanding real-world challenges for the Semantic Web, once you've stopped laughing or crying at the slides you should also read the corresponding paper (pdf, 415K ). Harth's co-authors are Aidan Hogan, Alexandre Passant, Stefan Decker, and Axel Polleres from DERI.
The DERI team has studied the incidence of various errors made by publishers of Linked Data "in the wild". Not so surprisingly, they find a lot of problems. For example, they find that 14.3% of triples in the wild use an undeclared property and 8.1% of the triples use an undeclared class. Imagine if a quarter of all sentences published on the web used words that weren't in the dictionary, and you'd have a sense of what that means. 4.7% of typed literals were "ill-typed". If 5% of the numbers in the phone book had the wrong number of digits, you'd probably look for another phone book.
They've even found ways that seemingly innocuous statements can have serious repercussions. It turns out that it's possible to "hijack" a metadata schema, and induce a trillion bad triples with a single Web Ontology Language (OWL) assertion.
Nit Free Terminator Lice Comb, Professional Stainless Steel Louse and Nit Comb for Head Lice Treatment, Removes NitsTo do battle with the enemy of badly published Linked Data, the DERI team urges community involvement in a support group that has been formed to help publishers fix their data. The "Pedantic Web" has 137 members already. This is a very positive and necessary effort. But they should realize that the correct data cause is a hopeless one. The vast majority of potential data publishers really don't care about correctness, especially when some of the mistakes can be so subtle. What they care about is accomplishing specific goals. The users of my linking software only cared that the links worked. HTML authors mostly care only that the web page looks right. Users of Facebook or Google RDFa will only care that the Like buttons or Rich Snippets work, and the fact that the schemas for these things either don't exist in machine readable form or are wildly inconsistent with the documentation is a Big Whoop.
Until of course, somebody does a sales demo, and the entire web crashes.
(nit and head louse photos from Wikimedia Commons)
Enhanced by Zemanta

Thursday, April 22, 2010

Facebook vs. Twitter: To Like or To Annotate?

Facebook and Twitter each held developer conferences recently, and the conference names speak worlds about the competing worldviews. Twitter's conference was called "Chirp", while Facebook's conference was labeled "f8" (pronounced "FATE"). Interestingly, both companies used their developer conferences to announce new capability to integrate meaning into their networks.

Facebook's announcement surrounded something it's calling the "Open Graph protocol". Facebook showed its market power by rolling it out immediately with 30 large partner sites that are allowing users to "Like" them on Facebook. Facebook's vision is that web pages representing "real-world things" such as movies, sports teams, products, etc. should be integrated into Facebook's social graph. If you look at the right-hand column of this blog, you'll see an opportunity to "Like" the Blog on Facebook. That action has the effect of adding a connection between a node that represents you on Facebook with a node that represents the blog on Facebook. The Open Graph API extends that capability by allowing the inclusion of web-page nodes from outside Facebook in the Facebook "graph". A webpage just needs to add a bit of metadata into its HTML to tell Facebook what kind of thing it represents.

I've written previously about RDFa, the technology that Facebook chose to use for Open Graph. It's a well designed method for adding machine-readable metadata into HTML code. It's not the answer to all the world's problems, but it can't hurt. When Google announced it was starting to support RDFa last summer, it seemed to be hedging its bets a bit. Not Facebook.

The effect of using RDFa as an interface is to shift the arena of competition. Instead of forcing developers to choose which APIs to support in code, using RDFa asks developers to choose among metadata vocabularies to support their data model. Like Google, Facebook has created its own vocabularies rather than use someone else's. Also, like Google last summer, the documentation for the metadata schemas seems not to have been a priority. Although Facebook has put up a website for Open Graph protocol at http://opengraphprotocol.org/ and a google group at http://groups.google.com/group/open-graph-protocol, there are as yet no topics approved for discussion in the group. [Update- the group is suddenly active, though tightly moderated.]

Nonetheless, websites that support Facebook's metadata will also be making that metadata available to everyone, including Google, putting increased pressure on websites to make available machine readable metadata  as the ticket price for being included in Facebook's (or anyone's) social graph. A look at Facebook's list of object types shows their business model very clearly. Here's their "Product and Entertainment" category:
  • album
  • book
  • drink
  • food
  • game
  • movie
  • product
  • song
  • tv_show
Whether you "Like" it or not, Facebook is creating a new playing field for advertising by accepting product pages into their social graph.

Facebook clearly believes in that fate follows its intelligent design. Twitter, by contrast, believes its destiny will emerge by evolution from a primordial ooze.

At Twitter's "Chirp" conference, Twitter announced that it will add "Annotations" to the core Twitter platform. The description of Twitter annotations is characteristically fuzzy and undetermined. There will be some sort of triple structure, the annotations will be fixed at a tweet's creation, and annotations will have either 512 bytes or maybe 1K. What will it be used for? Who knows?

Last week, I had a chance to talk to Twitter's Chairman and co-Founder Jack Dorsey at another great "Publishing Point" meeting. He boasted about how Twitter users invented hashtags, retweets and "@" references, and Twitter just followed along. Now, Twitter hopes to do the same thing with annotations. Presumably, the Twitter ecosystem will find a use for Tweet annotations and Twitter can then standardize them. Or not. You could conceivably load the Tweet with Open Graph metadata and produce a Facebook "Like" tweet.

Many possibilities for Tweet annotations, underspecified as they are, spring to mind. For example, the Code4Lib list was buzzing yesterday about the possibility that OpenURL references (the kind used in libraries to link to journal articles and books) could be loaded into an annotated tweet. It seems more likely to me that a standard mechanism to point to external metadata, probably expressed as Linked Data, will emerge. A Tweet could use an annotation to point to a web page loaded with RDFa metadata, or perhaps to a repository of item descriptions such as I mentioned in my post on Linked Descriptions. Clearly, it will be possible in some way or other to put real, actionable literature references into a tweet. Whether it will happen, it's hard to say, but I wouldn't hold my breath for Facebook to start adding scientific articles into its social graph.

Although there's a lot of common capability possible between Facebook's Open Graph and Twitter's Annotations, the worldviews are completely different. Twitter clearly sees itself as a communications media and the Annotations as adjuncts to that communication. In the twitterverse, people are entities that tweet about things. Facebook sees its social graph as its core asset and thinks of the graph as being a world-wide web in and of itself. People and things are nodes on a graph.

While Facebook seems offer a lot more to developers than Twitter, I'm not so sure that I like its worldview as much. I'm much more than a node on Facebook's graph.
Reblog this post [with Zemanta]

Sunday, April 18, 2010

When Shall We Link?

When I was in grad school, my housemates and I would sit around the dinner table and have endless debates about obscure facts like "there's no such thing as brown light". That doesn't happen so much in my current life. Instead, my family starts making fun of me for "whipping out my iPhone" to retreive some obscure fact from Wikipedia to end a discussion about a questionable fact. This phenomenon of having access to huge amounts of information has also changed the imperatives of education: students no longer need to learn "just in case", but they need to learn how to get information "just in time".

In thinking about how to bring semantic technologies to bear on OpenURL and reference linking, it occured to me that "just in time" and "just in case" are useful concepts for thinking about linking technologies. Semantic technogies in general, and Linked Data in particular, seem to have focused on just-in-case, identifier-oriented linking. Library linking systems based on OpenURL, in contrast, have focused on just-in-time description-oriented linking. Of course, this distinction is an oversimplification, but let me explain a bit what I mean.

Let's first step back and take a look at how links are made. Links are directional; they have a start and an end (a target). The start of a link always has an intention or purpose, the target is the completion of that purpose. For example, look at the link I have put on the word "grad school" above. My intention there was to let you, the reader, know something about my graduate school career, without needing to insert that digressional information in the narrative. (Actually my purpose was to illustrate the previous sentence, but let's call that a meta-purpose.) My choice of URL was "http://ee.stanford.edu/", but I might have chosen some very different URL. When I choose a specific URL, I "bind" that URL to my intention.

In the second paragraph, I have added a link for "OpenURL". In that case, I used the "Zemanta" plug-in to help me. Zemanta scans the text of my article for words and concepts that it has links for, and offers them to me as choices to apply to my article. Zemanta has done the work of finding links for a huge number of words and concepts, just in case a user come along with a linking intention to match. In this case, the link suggested by Zemanta matches my intention (to provide background for readers unfamiliar with OpenURL). The URL becomes bound to the word during the article posting process.

At the end of this article, there's a list of related articles, along with a link that says "more fresh articles". I don't know what URLs Zemanta will supply when you click on it, but it's an example of a just in time link. A computer scientist would call this "late binding". My intention is abstract- I want you to  be able to find articles like this one.

Similar facilities are in operation in scholarly publishing, but the processes have a lot more moving parts.

Consider the citation list of a scientific publication. The links expressed by these lists are expressions of the author's intent- perhaps to support an assertion in the article, to acknowledge previous work, or to provide clarification or background. The cited item is described by metadata formatted so that humans can read and understand the description and go to a library to find the item. Here's an example:
D. C. Tsui, H. L. Störmer and A. C. Gossard, Phys. Rev. Lett. 48, 1559 (1982).
With the movement of articles on-line, the citations are typically turned into links in the publication process by parsing the citation into a computer-readable description. If the publisher is a member of CrossRef, the description could then be matched against CrossRef's huge database of article descriptions. If a match is found, the cited item description is bound to an article identifier, the DOI. For my example article, the DOI is 10.1103/PhysRevLett.48.1559 The DOI provides a layer of indirection that's not found in Zemanta linking. While CrossRef binds the citation to an identifier, the identifier link, http://dx.doi.org/10.1103/PhysRevLett.48.1559, is not bound to the target URL, http://prl.aps.org/abstract/PRL/v48/i22/p1559_1 until the user clicks the link. This scheme holds out hope that should the article move to a different URL, the connection to the citation can be maintained and the link will still work.

If the user is associated with a library using an OpenURL link server, another type of match can be made. OpenURL linkservers use knowledgebases which describe the set of electronic resources made available by the library. When the user clicks on on OpenURL link, the description contained in the link is matched against the knowledgebase, and the user is sent to the best-matching library resource. It's only at the very last moment that the intent of the link is bound to a target.

While the combination of OpenURL and CrossRef has made it possible to link citations to their intended target articles in libraries with good success, there has been little leveraging of this success outside the domain of scholarly articles and books. The NISO standardization process for OpenURL spent a great deal of time in making the framework extensible, but the extension mechanisms have not seen the use that was hoped for.

The level of abstraction of NISO OpenURL is often cited as a reason it has not been adopted outside its original application domain. It should also be clear that many applications that might have used OpenURL have instead turned to Semantic Web and Linked Data technologies (Zemanta is an example of a linking application built with semantic technologies.) If OpenURL and CrossRef could be made friendly to these technologies, the investments made in these systems might also find application in more general circumstances.

I began looking at the possibilities for OpenURL Linked Data last summer, when, at the Semantic Technologies 2009 conference, Google engineers expressed great interest in consuming OpenURL data exposed via RDFa in HTML, which had just been finalized as a W3C Technical Recommendation. I excitedly began to work out what was needed (Tony Hammond, another member of the NISO standardization committee had taken a crack at the same thing.)

My interest flagged, however, as I began to understand the nagging difficulties of mapping OpenURL into an RDF model. OpenURL mapped into RDF was...ugly. I imagined trying to advocate use of OpenURL-RDF over BIBO, an ontology for bibliographic data developed by Bruce D'Arcus and Frédérick Giasson, and decided it would not be fun. There's nothing terribly wrong with BIBO.

One of the nagging difficulties was that OpenURL-RDF required the use of "blank nodes", because of its philosophy of transporting descriptions of items which might not have URIs to identify them. When I recently described this difficulty to the OpenURL Listserv, Herbert van de Sompel, the "irresistible force" behind OpenURL a decade ago, responded with very interesting notes about "thing-described-by.org", how it resembled "by-reference" OpenURL, and how this could be used in a Linked Data  friendly link resolver. Thing-Described-by is a little service that makes it easy to mint a URI, attach an RDF description to it, and make it available for harvest as Linked Data.

In the broadest picture, linking is a process of matching the intent of a link with a target. To accomplish that, we can't get around the fact that we're matching one description with another. A link resolver needs to accomplish this match in less than a second using a description squeezed into a URL, so it must rely on heuristics, pre-matched identifiers, and restricted content domains. If link descriptions were pre-published as Linked Data as in thing-described-by.org, linking providers would have time to increase accuracy by consulting more types of information and provide broader coverage. By avoiding the necessity of converting and squeezing the description into a URL, link publishers could conceivably reduce costs while providing for richer links. Let's call it "Linked Description Data".

Descriptions of targets could also be published as Linked Description Data. Target knowledgebase development and maintenance is a significant expense for link server vendors. However, target publishers have come to understand the importance (see KBART) of providing more timely, accurate and granular target descriptions. If they ever start to view the knowledgebase vendors as bottlenecks, the Linked Description Data approach may prove appealing.

Computers don't learn "just-in-time" or "just-in-case" the way humans do. But the matching at the core of making links can be an expensive process, taking time proportional to the square of the number of items (N2). Identifiers make the process vastly more efficient, (N*logN). This expense can be front-loaded (just-in-case) or saved till the last momemt (just-in-time), but opening the descriptions being matched for "when-there's-time" processing could result in dramatic advances in linking systems as a whole.
Reblog this post [with Zemanta]