Go To Hellman: Digital Object Identifier

Showing posts with label Digital Object Identifier. Show all posts

Thursday, May 20, 2010

Ten Evil Uses for URL Shortening Services

If you're like me, you don't work for Google. Now that Google has been given the worldwide legal monopoly on Not Being Evil, the rest of us must make our livings otherwise. And with Facebook starting to corner the market on monitoring our social interactions, it's getting harder and harder to make a splash on the Dark Side. Don't let that stop you. There are lots of nifty tools to help you run your start-up evilenture. Today, we cover URL shortening services: Bit.ly, TinyURL, Ow.ly and friends.

Here are ten link shortening menaces for you to nibble on.

No doubt you have your favorite website with a cross-site scripting vulnerability. But it can be a real pain to deliver a good attack script, and if you load it from a web site, there's a chance something might get traced back to you. No worries! A link shortener can help you load a bushel of attack code into one small friendly package. When your mark clicks on the link, he's delivered to that well-trusted but slightly buggy e-commerce website. Swipe his session cookies, forge an access token and personal info. He'll never even notice.
Phishing attacks are starting to look so lame. By now, people know to be suspicious when the 1etters in a hostname become numer1c. With a link shortener you can easily hide the hostname or IP address; when asking for credit card info, it's SO important to be user friendly.
You're into SQL injection? Link shorteners help you keep that DROP TABLES command from needlessly worrying your involuntary partners with privileges.
Spam blocking getting you down? URL Shorteners can help you neutralize unsolicited email identification systems which use hostnames to identify possible spam. "Girlz.xxx" is a great site, but "bit.ly" is a name you can show your fiancée's parents!
Don't forget that once you get past the spam blocker, you still need to avoid the porn filter used by the school system or Panera Bread. Also, your corporate and government customers will appreciate the deniability offered by a shortened link.
You've sent out the email blasts, but how do you know whether your eager audience receives your processed meat food or clicks on the links? The analytics provided by URL shortening services are a great solution! Shortened links are free, so you can use a new one for every recipient.
Is your botnet being detected and your links being broken? Most shorteners won't help you because they won't let you change your link after you've created it, but take a look at PURL. If one of your machines gets taken out, you can edit the PURL to keep your link working, and shorten it for good measure.
Ever wonder why there are so many URL shortening services? Chain a bunch of them together for fun, loopy amusement, and to confuse bit.ly! And add a Rickroll, while you're at it!
Want to slander Islam, you blasphemer? Or gossip about your boss, you slacker? Avoid those annoying fatwahs and performance improvement plans by using a shortener service that is blocked in Saudi Arabia or in your office.
Want to hog the credit for links to other people's content? Ow.ly can help you there.
BONUS! You know how the Evil guys torturing James Bond and Austin Powers are always based in a tiny island country or desert oasis? There's no better way to help those guys than to use the .LY (Libya), .CC (Cocos Islands), .GD (Grenada), .TO (Tonga) and .IM (Isle of Man) top level domains for as many links as possible.

But seriously...

Although Bit.ly and other URL shortening services tout their automated spam and malware detection and protection, they don't really explain why a URL shortening service needs spam and malware protection, or why this is a good reason for people to use their service. It's a bit like Domino's Pizza's big advertising campaign that explained how their pizza didn't taste awful anymore. You may have known that Domino's was not the tastiest of pizza's, but perhaps you didn't realize that shortened links might also be greasy and indigestive. Now you do.

In my post on shortDOI, I made a passing comment about Bit.ly's spam countermeasures that seemed to imply that the Digital Object Identifier (DOI) redirection service was somehow lacking in spam prevention. That was a mistake and a result of absent context.

As I've described here, there are lots of ways to abuse a link redirection service. If a service is frequently abused, its links may get blocked, its ISP may start to get complaints and threaten to shut it off, and its reputation will suffer. So link redirection services of all kinds need to have robust and scaleable ways to prevent abuse.

DOI uses a very different mechanism to avoid malware and spam. They are selective about who may create and administer DOI links. This is great if you're someone who clicks on DOI links, but bad if you haven't been approved by DOI's vetting procedures. You probably can't even figure out if DOI would approve you or not. PURL, which has a similar objective of improving link persistence, takes a similar strategy but has a lower entry barrier.

The contrast between Bit.ly and DOI makes clear that the biggest benefit of Bit.ly's spam and malware mechanisms is not that they make bit.ly links safer than DOI links, it's that they allow you to use their service, even when they don't trust you.

It's still pizza, even if the sauce is better.

Thursday, May 13, 2010

A Long Handle on Shortened Digital Object Identifiers

Google has launched a social networking site called Orkut. Already the site has over 100 million members worldwide! What's that? You haven't heard of it? No it's not new, it's actually more than 6 years old. The 100 million users- they're mostly in Brazil and India.

You might be asking yourself, "What is Google doing running a social networking site for Brazil and India?", but a better question might be "What do you do in life when you take a home run swing and manage a nub single?"

The technology behind the Digital Object Identifier has proven to be one of these infield hits. It's called the "Handle" system, and it was developed fifteen years ago by computer scientists who feared that the DNS system that maps host names to IP addresses on the internet would fail to scale for the very large numbers of digital objects needed in future digital library applications. Bob Kahn, who with Vint Cerf designed the Transmission Control Protocol (TCP) underlying today's internet, was the driving force behind the Handle system. It implements a highly scaleable distributed naming infrastructure for digital objects. Kahn's organization, CNRI, then developed the Digital Object Identifier to serve as a core application of the Handle System.

The Digital Object Identifier, or DOI, was designed and promoted as a solution to two problems, broken URLs and rights management. The fact that fifteen years later broken URLs are still a problem and digital rights are a quagmire suggests that the DOI has had limited impact in these areas. Did something go wrong?

Fifteen years ago, Google did not exist. Netscape was just a year old. The developers of the Handle system hoped that Handle resolution would get built into web browser software alongside DNS. That never really happened, perhaps because Netscape and Microsoft had development roadmaps for web servers and browsers that diverged from CNRI's vision. To allow Handles to work in unmodified browser software, CNRI was forced to implement a proxy service that connected the system of DNS resolution to the system of handle resolution via a web server. This proxy, at http://dx.doi.org/, serves almost all of the world's DOIs. A link pointed at the DOI proxy gets redirected to a web address designated by the owner of the digital object or perhaps to a library with DOI-enabled software. This redirection capability gives publishers the flexibility to move digital objects from one address to the other without breaking links, even if the object is traded to another publisher using a completely different website.

Things have changed in 15 years. The World Wide Web turned out to be not so interested in digital objects with complex management systems and rights regimes. Instead, the World Wide Web turned out to be an unthinkably large number of web pages with neglible rights management indexed by search engines. Persistence of links turned out to be less important than the findability of content in search engines.

Since search engines never bothered to learn about the Handle system, the DOI proxy turned out to be much more important than the Handle resolution system that hides behind it. Details that were inconsequential 15 years ago have become important. One of these details is the DOI proxy's HTTP status code. This code tells a requestor the meaning of the redirect URL. As I've previously written, the DOI proxy returns a 302 status code. The Google index inteprets this to mean that it should assign indexing weight to the URL beginning with "http://dx.doi.org/", and not the publisher-specified URL. 302 is the is the correct code for the DOI proxy, because if the publisher-specified URL changes, the publisher doesn't want to lose all the "link juice" it has earned by being linked to from other sites.

The DOI has not been generally adopted by the web at large, with an important exception, CrossRef. CrossRef added some useful machinery onto the DOI and turned it into an absolutely essential service for publishers of ejournals other types of content that need to embed persistent links. CrossRef accounts for 96% of all registered DOIs and DOI resolutions (about 60 million per month).

60 million resolutions per month might seem like a lot of traffic, but it's not so big on the scale of today's web. Google delivers that many searches in 3 hours. A more direct comparison would be bit.ly, the URL shortening service, which reported 3.4 BILLION resolutions in March, or 4.6 million resolutions per hour.

The staggering popularity of URL shortening services such as bit.ly prompted CrossRef's Geoffrey Bilder to propose last year a similar service for DOI links. Evidently, the folks at the International DOI Foundation agreed that this was a good idea, because last week, they launched the "shortDOI" service.

ShortDOI is meant to address a shortcoming of DOIs- their length and ugliness. When DOI started, no one could have imagined that URLs would appear prominently in boxes of children's cereal, as they do today. It was assumed that they would be hidden in links and be used exclusively by machines. The original spec for the DOI string even allowed DOIs to include spaces and non-printing Unicode characters! Digital object owners were free to choose ID strings that were long and full of punctuation, even punctuation that was incompatible with web pages. ShortDOI uses a small number of alphanumeric digits to do away with all the DOI ugliness. It also does away with the publisher prefix, which hasn't been much use anyway. So instead of 10.1103/PhysRevLett.48.1559 or 10.1002/(SICI)1097-0258(19980815/30)17:15/16<1661::AID-SIM968>3.0.CO;2-2 shortDOI lets us use URLs like http://doi.org/aa9 and http://doi.org/aabbe.

ShortDOI can't quite decide whether it's a DOI or a shortener. Like DOI, it's implemented using the Handle system. Instead of redirecting through the DOI proxy, shortDOI acts as an alternate proxy, and uses the same 302 redirects that the long DOI proxy uses. From the perspective of search engines, a shortDOI is a new object to be ranked separately from the DOI. The link juice earned by a short DOI won't accrue to the DOI it has shortened.

Like a shortener, shortDOI assigns codes sequentially, making it easy for robots to harvest content identified by shortDOIs. ShortDOI allows anyone to create the shortened URL, but provides none of the tracking, statistics, spam protection and malware rejection offered by other shortener services. Library OpenURL servers don't yet work with shortDOI, even though the shortDOI proxy will try to send shortDOI handles to them.

The implementation choices made for shortDOI aren't mistakes- they make perfect sense in the context of the Handle naming architecture. Nonetheless, the difficulties they present for libraries and search engine optimization highlight the Handle system's misfit with the dominant link resolution architecture of the web.

The Handle system has been very well designed and managed. I've written software that uses the Handle software libraries and I found them to be elegant and easy to work with. The principles and algorithms built into Handle system are very similar to those used years later inside Google's internal file system or by any number of other large digital object management systems.

The Handle System is relatively inexpensive, but the costs are now higher than the large scale URL shorteners. According to public tax returns, the DOI Foundation pays CNRI about $500,000 per year to run the DOI resolution system. That works out to about 0.7 cents per thousand resolutions. Compare this to Bit.ly, which has attracted $3.5 million of investment and has resolved about 20 billion shortened links- for a cost of about 0.2 cents per thousand. It remains to be seen whether bit.ly will find a sustainable business model; competing directly with DOI is not an impossibility.

What do you do with infrastructure that has been successful in Brazil or scholarly publishing but not elsewhere? Do you keep it alive in hopes that after twenty years, some unforeseen circumstance will result in its overnight universal adoption? Do you scale back, phase out or sell out in favor of more cost effective alternatives? Or do you just do your best to continue serving loyal users? I don't know the answer, but I do know that in baseball and cricket you've got to run the bases to score.

Update: Corrected CrossRef share of DOI resolutions.

Tuesday, November 24, 2009

Publish-Before-Print and the Flow of Citation Metadata

Managing print information resources is like managing a lake. You need to be careful about what flows into your lake and you have to keep it clean. Managing electronic information resources is more like managing a river-

it flows though many channels, changing as it goes, and it dies if you try to dam it up.

I have frequently applied this analogy to libraries and the challenges they face as their services move online, but the same thing is true for journal publishing. A journal publisher's duties are no longer finished when the articles are bound into issues and put into the mail. Instead, publication initiates a complex set of information flows to intermediaries that help the information get to its ultimate consumer. Metadata is sent to indexing services, search engines, information aggregators, and identity services. Mistakes that occur in these channels will prevent customer access just as profoundly as the loss of a print issue, and are harder to detect, as well.

A large number of journals have made the transition from print distribution to dual (print+electronic) distribution; many of those journals are now considering the transition to online-only distribution. As they plan these transitions, publishers are making decisions that may impact the distribution chain. Will indexing services be able to handle the transition smoothly? Will impact factors be affected? Will customer libraries incur unforeseen management costs?

I was recently asked by the steering committee of one such journal to look into some of these issues, in particular to find out about the effects of the "publish-before-print" model on citations. I eagerly accepted the charge, as I've been involved with citation linking in one way or another for over 10 years and it gave me an opportunity to reconnect with a number of my colleagues in the academic publishing industry.

"Publish-before-print" is just one name given to the practice of publishing an article "version of record" online in advance of the compilation of an issue or a volume. This allows the journal to publish fewer, thicker issues, thus lowering print and postage costs, while at the same time improving speed-to-publication for individual articles. Publish-before-print articles don't acquire volume, issue and page metadata until the production of the print version.

Before I go on, I would like to recommend the NISO Recommended Practice document on Journal Article Versions (pdf, 221KB). It recommends the use of "Version of Record" as the terminology to use instead of "published article" which is widely used in a number of circumstances:

Version of Record (VoR) is also known as the definitive, authorized, formal, or published version, although these terms may not be synonymous.

Many publishers today have adopted the practice of posting articles online prior to printing them and/or prior to compiling them in a particular issue. Some are evolving new ways to cite such articles. These “early release” articles are usually [Accepted Manuscripts], Proofs, or VoRs. The fact that an “early release” article may be used to establish precedence does not ipso facto make it a VoR. The assignment of a DOI does not ipso facto make it a VoR. It is a VoR if its content has been fixed by all formal publishing processes save those necessary to create a compiled issue and the publisher declares it to be formally published; it is a VoR even in the absence of traditional citation data added later when it is assembled within an issue and volume of a particular journal. As long as some permanent citation identifier(s) is provided, it is a publisher decision whether to declare the article formally published without issue assignment and pagination, but once so declared, the VoR label applies. Publishers should take extra care to correctly label their “early release” articles. The use of the term “posted” rather than “published” is recommended when the “early release” article is not yet a VoR.

"Version of Record before Print" is a bit of a mouthful, so I'll continue to use "publish-before-print" here to mean the same thing.

It's worth explaining "Assignment of a DOI" a bit further, since it's a bit complicated in the case of publish-before-print. Crossref issued DOIs are the identifiers used for articles by a majority of scholarly journal publishers. To assign the DOI, the a publisher has to submit a set of metadata for the article, along with the DOI that they want to register. The Crossref system validates the metadata and stores it in its database so that other publishers can discover the DOI for citation linking. In the case of publish-before-print, the submitted metadata will include journal name, the names of the authors, the article's title, and the article's URL, but will be missing volume, issue and page numbers. After the article has been paginated and bound into an issue, the publisher must resubmit the metadata to Crossref, with added metadata and the same DOI.

What happens if the online article is cited in an article in another journal during the time between the version of record going online and the full bibliographic data being assigned? This question is of particular importance to authors whose citation rates may factor into funding or tenure decisions. Since the answer depends on the processes being used to publish the citing article and produce the citation databases, so I had to make a few calls to get some answers.

As you might expect, journal production processes vary widely. Some journals, particularly in the field of clinical medicine, are very careful to check and double check the correctness of citations in their articles. For these journals, it's highly likely that the editorial process will capture updated metadata. Other publishers take a much more casual approach to citations, and publish whatever citation data the author provides. Most journals are somewhere in the middle.

Errors can creep into citations in many ways, including import of incorrect citations from another source, mispelling of author names, or simple miskeying. DOIs are particularly vulnerable to miskeying, due to their length and meaninglessness. One of my sources estimates that 20% of author keyed DOIs in citations are incorrect! If you have the opportunity to decide on the form of a DOI, don't forget to consider the human factor.

It's hard to get estimates of the current error rate in citation metadata; when I was producing an electronic journal ten years ago, my experience was consonant with industry lore that said that 10% of author-supplied citations were incorrect in some way. My guess, based on a few conversations and a small number of experiments, is that a typical error rate in published citations is 1-3%. A number of processes are pushing this number down, most of them connected with citation linking in some way.

Reference management and sharing tools such as RefWorks, Zotero, and Mendeley now enable authors to acquire article metadata without keying it in and link citations even before they even submit manuscripts for publication; this can't help but improve citation accuracy. Citation linking in the copy editing process also improves the accuracy of citation metadata. By matching citations to databases such as Crossref and PubMed, unlinked citations can be highlighted for special scrutiny by the author.

Integration of citation linking into publishing workflow is becoming increasingly common. In publishing flows hosted by HighWire Press' Bench>Press manuscript submission and tracking system, Crossref and Pubmed can be used at various stages to help copyeditors check and verify links. Similarly, ScholarOne Manuscripts, a manuscript management system owned by Thomson Reuters, integrates with Thomson Reuters' Web of Science and EndNote products. Inera's xStyles, software that focuses specifically on citation parsing and is integrated with Aries Systems' Editorial Manager, has recently added an automatic reference correction feature that not only checks linking, but also pulls metadata from Crossref and Pubmed to update and correct citations. I also know of several publishers that have developed similar systems internally.

In most e-journal production flows, there is still a publication "event", at which time the content of the article, including citations, becomes fixed. The article can then flow to third parties that make the article discoverable. Of particular interest are citation databases such as Thomson Reuters' Web of Science (this used to be ISI Science Citation Index). The Web of Science folks concentrate on accurate indexing of citations; they've been doing this for almost 50 years.

Web of Science will index an article and its citations once it has acquired its permanent bibliographic data. The article's citations will then be matched to source items that have already been indexed. Typically there are cited items that don't get matched - these might be unpublished articles, in-press articles, and private communications. Increasingly, the dangling items include DOIs. In the case of a cited publish-before-print article, the citation will remain in the database until the article has been included in an issue and indexed by Web of Science. At that point, if the DOI, journal name, and first author name all match, the dangling citation is joined to the indexed source item so that all citations of the article are grouped together.

Google's PageRank is becoming increasingly important for electronic journals, so it's important to help Google group together all the links to your content. The method supported by Google for grouping URL's is the rel="canonical" meta tag. By putting a DOI based link into this tag on the article web pages, publishers can ensure that the electronic article will be ranked optimally in Google and Google Scholar.

An increasingly popular alternative to publish-before-print is print-oblivious article numbering. Publishers following this practice do not assign issue numbers or page numbers, and instead assign article numbers when the version-of-record is first produced. Downstream bibliographic systems have not universally adjusted to this new practice; best paractices for article numbers are described in an NFAIS Report on Publishing Journal Articles (pdf 221KB).

In summary, the flow of publish-before-print articles to end users can be facilitated by proper use of DOIs and Crossref.

Prompt, accurate and complete metadata deposit at the initial online publication event and subsequent pagination is essential.
DOI's should be constructed with the expectation that they will get transcribed by humans.
Citation checking and correction should be built into the article copyediting and production process.
Use of DOI in rel="canonical" metatags will help in search engine rankings.

Thursday, July 9, 2009

URL Shorteners and the Semantics of Redirection

When I worked at Bell Labs in Murray Hill, NJ, it amused me that at one end of the building, the fiber communications people were worrying that no one could ever possibly make use of all the bandwidth they could provide- we would never be able to charge for telephone calls unless they figured out how to limit the bandwidth. At the other end of the building, computer scientists were figuring out how to compress information so that they could pack more and more into tiny bit-pipes. I'm still not sure who won that battle.

When I was part of a committee working on the OpenURL standard, we had a brief discussion about the maximum length URL that would work over the internet. A few years before that, there were some systems on the internet that barfed if a URL was longer than 512 characters, but most everything worked up to 2,000 characters, and we anticipated that that limit would soon go away. So here we are in 2009, and Internet Explorer is just about the only thing that still has a length limit as low as 2083 characters. Along comes Twitter, with a 140 character limit on an entire message, and all of a sudden, the URL's we've been making have become TOO LONG! Just as fast, URL shortening services sprung up to make the problem go away.

The discussion on my last post (on CrossRef and OpenURL) got me interested in the semantics of redirection, and that got me thinking about the shortening services, which have become monster redirection engines. When we say something about a URI that is resolved by a redirector, what, exactly are we talking about?

First, some basics. A redirection occurs when you click on a link and the web server for that link tells your browser to go to another URL. Usually, the redirection occurs in the http protocol that governs how your web browser gets web pages. Sometimes, a redirect is caused by a directive in an html page, or programmed by a javascript in that page. The result may seem the same but the mechanism is rather different, and I won't get into it any further. There are actually 3 types of redirects provided for in the http protocol, known by their status codes as "301" "302" and "303" redirects. There are 5 other redirect status codes that you can safely ignore if you're not a server developer. The 301 redirect is called "Moved Permanently", the 302 is called "Found" and the 303 is called "See Other". Originally, the main reason for the different codes was to help network servers figure out whether to cache the responses to save bandwidth (the fiber guys had not deployed so much back then and the bit squeezers were top dogs). Nowadays the most important uses of the different codes are in search engines. Google will interpret a 301 as "don't index this url, index the redirect URL". A 302 will be interpreted as "index the content at the redirect URL, but use this URL for access". According to a great article on url shorteners by Danny Sullivan, Google will treat a 303 like a 302, but who knows?

Just as 301 and 302 semantics have been determined by their uses in search engines, the 303 has been coopted by the standards-setters of the semantic web, and they may well be successful in determining the semantics of the 303. As described in a W3C Technical Recommendation, the 303 is to be used

... to give an indication that the requested resource is not a regular Web document. Web architecture tells you that for a thing resource (URI) it is inappropriate to return a 200 because there is, in fact, no suitable representation for those resources.

In other words, the 303 is suppoesed to indicate that the Thing identified by the URI (URL) is something whose existence is NOT on the web. Tim Berners-Lee wrote a lengthy note about this that I found quite enjoyable, though at the end I had no idea what it was advocating. The discussion that led to the W3C Recommendation has apperently been extremely controversial, and has been given the odd designation "http-range-14". The whole thing reminds me of reading the existentialists Sartre and Camus in high school - they sounded so much more understandable in French!

As discussed in Danny Sullivan's article, most of the URL shorteners use 301 redirects, which is usually what most users want to happen. An indexing agent or a semantic web agent should just look through these redirectors and use the target resource URL in its index. The DOI "gateway" redirector at dx.doi.org discussed in my previous post uses a 302 redirect. Unless doi's are handled specially by a search engine, it means that the "link credit" (a.k.a. google juice) for a dx.doi.org link will accrue to the dx.doi.org URL rather than the target URL. This seems appropriate. Although I indicated that if you use Linked Data rules the dx.doi.org link identifies whatever is indicated by the returned web page, from the point of view of Search engines, that URI identifies an abstraction of the resource it redirects to. A redirection service similar in conception, PURL, also uses 302 redirects.

I was curious about the length limits of the popular url shorteners. Using a link to this blog, padded by characters ignored by Blogger.com, I shortened a bunch of long URLs. Here are 4 shortened 256 character links to this blog:

They all work just fine. Moving to 1,135 character links, everything still works (at least in my environment):

At 2083 characters, the limit for Internet Explorer, we start separating the redirection studs from the muffins.

http://bit.ly/171s4y
http://snurl.com/mdwem
http://is.gd/1ro7Q clips the URL at 2000 characters
tr.im fails to shorten the URL.

When I add another character, to make 2,084 total, bit.ly and snurl.com both work, but blogger.com reports an error!

The compression ratios for these last two links is 109 to 1 for bit.ly and 95 to 1 for snurl. The bit squeezers would be happy.

Next, I wanted to see if I could make a redirection loop. Most of the shortening services decline to shorten a shortened URL, but they're quite willing to shorten a URL from the PURL service. Also, I couldn't find any way to use the shortening services to fix a link that had rotted after I shortened it. It could be useful to add the PURL service as link-rot insurance behind a shortened url if the 302 redirect is not an issue. So here's a PURL: http://purl.oclc.org/NET/backatcha that redirects to http://bit.ly/aE0od which redirects to http://purl.oclc.org/NET/backatcha etc. Don't click these expecting an endless loop- your browser should detect the loop pretty fast.

A recent article about how bit.ly is using its data stream to develop new services got me thinking again about how a shortening redirector might be useful in Linked Data. I've written several times that Linked Data lacks the strong attribution and provenance infrastruction needed for many potential applications. Could shortened URIs be used as Linked Data predicates to store and retrieve attribution and provenance information, along with the actual predicate? And will I need another http status code to do it?

Monday, July 6, 2009

Crossref, OpenURL and more Linked Data Heresy

After CrossRef was started nine years ago, I quipped that it was nothing short of miraculous, since it was the first time in recorded history that so many publishers had gotten together and agreed on something that they would have to pay for. I'm sure that was an exageration, but my point was that CrossRef was not really about linking technology, rather, it was about the establishment of a business process around linking technology. The choice of technology itself was to some extent irrelevant.

Last week, in a comment on my post about AdaptiveBlue and OpenURL, Owen Stephens raised some interesting questions surrounding OpenURL, DOI (Digital Object Identifier), and Linked Data. It's useful to think of each of these as a social practice surrounding a linking technology; I'll describe each of them in turn.

DOI is often thought of as synonymous with CrossRef, which is incorrect. DOI is a link indirection technology used by the CrossRef organization. There are some DOIs that are not CrossRef DOIs, but most of the

DOIs you are likely to come across will be CrossRef DOIs. CrossRef provides registration, matching and lookup services in addition to the DOI redirection service, and from here on, I'll be talking about CrossRef DOIs only. The core mission of Crossref is the transformation of journal article citations into clickable URLs. CrossRef has registered about 35 million DOIs, most of them for journal articles. In the registration process, CrossRef collects identifying metadata for the journal articles, which it then uses to power its matching and lookup services. The matching service is currently making about 15 million matches per month.

CrossRef is far from being perfect, but its achievements have been considerable. Most scholarly journal publishers have integrated the CrossRef registration and matching process into their

production workflows. The result is that many thousands of electronic journals today are being linked to from many thousands of other electronic journals, databases, search engines, even blogs.

In contrast to CrossRef, which is focuses on publishers and publisher workflow integration, OpenURL is a linking technology and practice that has focused on helping libraries manage links to and from the electronic resources available to their patrons. OpenURL is complementary to Crossref- OpenURL linking agents usually make use of CrossRef services to accomplish their mission of helping users select the appropriate resources for a given link. Libraries frequently need to deal with problems associated with multiple resolution- a given article might be available at ten or even a hundred different URLs, only one of which might work for a given library patron.

Finally, Linked Data is an emerging practice which enables diverse data sets to be published, consumed and then linked with other data sets and relinked into a global web of connections. It would be interesting to find out how many matches are being made in the Linked Data web to compare with CrossRef, but because of the decentralized matching, its not really possible to know. While CrossRef and OpenURL focuses on connecting citing articles and abstracts with the cited articles, Linked Data attempts to support any type of logical link.

Obviously there is overlap between Linked Data and the more established linking practices. Can (and should) Linked Data applications reuse the CrossRef and/or OpenURL URI's? Let's first consider OpenURL. OpenURL is really a mechanism for packaging metadata for a citation (jargon: ContextObject) into a URI. So the "thing" that an OpenURL URI identifies is the set of services about the citation available from a particular resolver agent. That's not usually the thing that you want to talk about in a Linked Data Application.

What about CrossRef DOIs? There are two different URI's that you can make with a DOI. There's the http URL that gets redirected to full text (you hope) by the DOI gateway: http://dx.doi.org/10.1144/0016-76492006-123 There's also the "info-uri" form of the doi- info:doi/10.1144/0016-76492006-123 , which you can't click on. It's clear what the latter URI identifies- it's a 2007 article in the Journal of the Geological Society. Many libraries run resolver agents that can turn that URI into clicakable service links. I'm not sure what the former URI identifies. What the URI gets you to is a web page with links to two different instantiations of the article identified by the info-uri. Apparently it doesn't identify the same article in its other instantiations on the internet. So the most correct URI to use, if you want to make Linked Data assertions about the article, is (in my humble but correct opinion) to use the info-uri.

There's one little problem.

The second of Tim Berners-Lee's "Four Rules" for Linked Data is "Use HTTP URIs so that people can look up those names." But CrossRef, a stable, self-sustaining organization which has made huge strides moving the world of journal publishing to a more open, more usable, more linked environment, provides look-up APIs that return high quality XML metadata so that you can look up the names that it defines. It has a solid record of accomplishing exactly the things that Linked Data is trying to do, albeit with broader scope, but undeniably with significant impact. The identifier that CrossRef is using is the DOI, and the URI form of DOI is NOT an HTTP URI.

Maybe Tim BL's second rule is wrong, too!

Go To Hellman

Thursday, May 20, 2010

Ten Evil Uses for URL Shortening Services

Thursday, May 13, 2010

A Long Handle on Shortened Digital Object Identifiers

Tuesday, November 24, 2009

Publish-Before-Print and the Flow of Citation Metadata

Thursday, July 9, 2009

URL Shorteners and the Semantics of Redirection

Monday, July 6, 2009

Crossref, OpenURL and more Linked Data Heresy

Blog Archive

Popular Posts

Me

Go To Hellman Fan Page

Labels

Go To Hellman

Thursday, May 20, 2010

Ten Evil Uses for URL Shortening Services

Thursday, May 13, 2010

A Long Handle on Shortened Digital Object Identifiers

Tuesday, November 24, 2009

Publish-Before-Print and the Flow of Citation Metadata

Thursday, July 9, 2009

URL Shorteners and the Semantics of Redirection

Monday, July 6, 2009

Crossref, OpenURL and more Linked Data Heresy

Blog Archive

Popular Posts

Subscribe To

Me

Go To Hellman Fan Page

Labels