An editor critiques the publishing industry's Automated Content Access Protocol

The
Automated Content Access Protocol (ACAP)
is a new technical venture by an international consortium of
publishers, and a proposed technical solution to the tug of war
between publishers and intermediaries such as search engines and news
aggregation sites. This article goes into some detail about ACAP and
offers both a technical and a philosophical context for judging its
impact and chances of success.

I exchanged email with Mark Bide of Rightscom Limited, Project
Coordinator of ACAP, who provided explanations of the project and a
defense in response to my critique. I’ll incorporate parts of his
insights here–of course with his permission!

What the Automated Content Access Protocol does

Clashes between publishers and intermediary Internet sites have
entered the news routinely over the past decade. Publishers recognize
the waste and risk involved in their current legal activities,
notably:

Suing sites that offer large chunks of books or news articles without
licensing them, a concern related to the Association of American
Publishers’ lawsuit against Google for providing a new channel to
long-forgotten information and potential sales. (So far as I know,
O’Reilly Media was the only publisher
who publicly backed Google
in this controversy.) A major part of ACAP is devoted to listing a
“snippet,” “extract,” or “thumbnail” that can be displayed when a site
is found, and nailing down the requirement that search engines display
only what the publisher wants displayed.
Suing news crawlers that display articles in frames next to
advertisements that generate revenue for the news crawlers–in place
of the original advertisements that the publisher put up to generate
revenue for itself. One field in the protocol is provided to
explicitly forbid this practice.

The designers of ACAP posit a cooperation between publishers and
search sites, whereby publishers put up their specifications for
display and search sites honor these specifications. The ACAP
specifications
(listed mostly in Part 1 of a technical document on the site I
pointed to) is loaded with a range of features that publishers think
would improve their business model, such as requirements that search
engines accompany links with credits or licensing conditions for
articles.

One’s judgment of ACAP could be influenced by whether one sees the
current problems in publishing as social or technical. Bide says the
problems are technical ones, because publishers can’t convey their
intentions along with their content search engines and aggregators. I
see difficult social issues here, such as author integrity versus
innovation in derivative uses, and copyright infringement versus fair
use. I see ACAP as a classic stratagem of applying technical fixes to
social problems.

The whole bag is presented as an extension of the traditional
robots.txt file, which is like describing the Pacific Ocean
as an extension of the Bering Strait. Unlike the simple yes/no
decisions and directory listings of robots.txt, ACAP
overflows with conditions that, as we will see, multiply rapidly into
a world of calculations.

Officially backed by the World Association of Newspapers, the
International Publishers Association, and the European Publishers
Council, ACAP seems tailored mostly to news sites, but appeals to
other publishers as well.

What is the goal?

In my opinion, ACAP is a platform for a new business partnership
between publishers and search engines. Considering how much work the
search engines will have to do to re-instrument many parts of their
code for gathering and displaying information, I don’t believe they’ll
do it without a cut of the take. Thus, an important feature of
robots.txt that ACAP employs is the User-agent line
that can provide rules for a particular crawler.

If successful, this initiative could turn into nothing less than a new
information source that’s separate from and parallel to the existing
Internet, although using its underlying lines and protocols. I would
not be surprised if publishers started encrypting content so it
wouldn’t even be found by search engine that fail to obtain a license.

I summarize the attitude of the publishing consortium that thought up
ACAP as “We don’t like the Internet, but we’d better compete with it.”
I acknowledge that the publishers will feel this characterization is
insulting, but I intend it to highlight the difference between the
free flow for which the Internet is known and the hedges built by
ACAP.

On the other hand, the success of ACAP depends on search engines
integrating ACAP-controlled searches with general search results. The
best hope publishers have is to see their content promoted in
general-purpose searches, cheek by jowl with free content from all
around the Internet. Segregating content, even if just in searches,
would put publishers on the same road as CompuServe.

Currently, many web users link to publisher content or republish it
without using crawlers, but publishers presumably don’t need a special
protocol to deal with any economic impact of occasional deep links or
copying. Still, ACAP may also have applications outside of search
engines, according to Bide. It could be the basis for agreements with
many trading partners.

Technical demands of ACAP

Lauren Weinstein
presciently demonstrates
that publishers are likely to turn ACAP from a voluntary cooperation
into a legal weapon, and suggests that it shifts the regulatory burden
for copyright infringement (as well as any other policy defined at the
whim of the publishers) from the publishers to the search engines. I
would add that a non-trivial technical burden is laid on search
engines too.

Bide assured me that the designers of ACAP consulted with several
search engine companies (most of whom do not want to be listed
publicly) and has run tests establishing that the technology is
feasable. I’m sure search engines can handle the calculations required
to observe ACAP rules for publishers who use it, given that the search
enginers already routinely generate indexes from billions of
documents, each containing thousands of words.

Bide writes, “There is some overhead on deciding at display time
whether or not and how a particular item can be displayed (as snippet,
thumbnail, etc), but the proportion of web pages about which search
engines will have to make this kind of decision is tiny, since they
will only be associated with high-value content from commercial
publishers, who represent only a small proportion of the content of a
search engine’s index.”

So here I simply ask how much coding and validation is required to
conform to ACAP, and what the incentive is for publishers and search
engines to do so.

First, the search engine must compile a policy that could be a
Cartesian product of a huge number of coordinates, such as:

Whether to index the actual page found, or another source specified by
the publisher as a proxy for that page, or just to display some fixed
text or thumbnail provided by the publisher
When to take down the content or recrawl the site
Whether conversions are permitted, such as from PDF to HTML
Whether translations to another language are permitted

Although publishers will probably define only one or two policies for
their whole site, the protocol allows policies to be set for each
resource (news page, picture, etc.) and therefore the search engine
must store the compiled policies and re-evaluate them when it
encounters each resource. To display content in conformance with
publishers’ wishes, the search engine must store some attributes for
the entire time it maintains information on the page.

Seasoned computer programmers and designers by now can recognize the
hoary old computing problem of exponential complexity–the trap of
trying to apply a new tool to every problem that is currently begging
for an a solution. Compounding the complexity of policies is some
complexity in identifying the files to which policies apply. ACAP uses
the same format for filenames as robots.txt does, but some
rarely-used extensions of that format interact with ACAP to increase
complexity. Search engines decide which resources to apply a policy
to by checking a filename such as:


/news/*/image*/

The asterisks here can refer to any number of characters, including
the slashes that separate directory names. So at whatever level in the
hierarchy the image*/ subdirectories appear, the search
engine has to double back and figure out whether it’s part of
/news/. The calculation involved here shouldn’t be as bad as
the notorious wildcard checks that can make a badly designed regular
expression or SQL query take practically forever. For a directory
pathname, there are ways to optimize the check–but it still must be
performed on every resource. And if there are potentially competing
directory specifications (such as /news/*.jpg) the search
engine must use built-in rules to decide which specification applies,
a check that I believe must be done at run-time.

The ACAP committee continues to pile on new demands. Part 2 of their
specification adds the full power of their protocol to META tags in
HTML files. This means each HTML file could potentially have its own
set of policies, and the content of the file must be read to determine
whether it does. Once again, publishers are not likely to
use ACAP that way, but the provision of that feature
would require the search engine to be prepared to compile a policy and
store it for each HTML file. As Bide points out, Yahoo! already
provides a
“nocontent” class
that can be placed on any element in an HTML file to keep crawlers
from indexing that element. I maintain that such extensions don’t
require search engines to juggle a large set of policies for each
document, as ACAP does.

Search engines may be spared some of the complexity (and the
consequent risk of error) of ACAP implementation if the project
provides a library to parse the specifications, but each search engine
still must hook the operations into its particular algorithms and
functions–and these hooks extend to nearly everything it does with
content.

ACAP as a collaboration

The concept of collaboration between search engines and targeted sites
is not new. I spoke enthusiastically about such an effort back in
December 2003, going so far to call it a harbinger of
search’s next generation.
My interest at that time lay in dynamic content: the huge amount of
facts stored in databases and currently unavailable to search engines
because they normally look only at static content.

For example, you can search for your Federal Express package number on
the Federal Express web site, but only because Federal Express submits
your search to its own database. The content doesn’t exist in any
static form available to Google. But Google can use the Federal
Express site to do the search and act as middleman. As described in
BusinessWeek article:

…Google is providing this new shipment tracking service even though
it doesn’t have a partnership with FedEx. Rather, Google engineers
have reprogrammed it to query FedEx directly with the information a
user enters and provide the hyperlink direct to the customer’s
information.

This is Web 2.0 long before Tim O’Reilly coined the term: two sites
mashed-up for the convenience of the user. Such combined searches are
flexible, open the door to new applications, and provide a wealth of
data that was previously hidden. Search engines have also implemented
dynamic information retrieval in other areas, such as airline flights
and patents. Searches for addresses turn up names of institutions
located at those addresses, while searches for institutions turn up
addresses, maps, etc.

Contrast such innovation with ACAP. It imposes rules in the place of
flexibility, closes off possibilities rather than opens them up, and
makes data less valuable. When a publisher can require that a specific
blurb be presented instead of the relevant content found by a search
engine, not only does that prevent innovation in searches; it tempts
the publisher to broadcast self-serving promotions that ill serve the
users trying to make informed judgments about the content they’re
searching for.

ACAP is also the opposite of productive collaboration in a technical
sense: far from the publisher contributing its own resources (not to
mention its own intelligence regarding the structure and details of
its content) to a search–instead, the publisher is putting an
additional burden of calculation on the search engine.

Finally comes the problems of standardization that I described three
years ago in the article

From P2P to Web Services: Addressing and Coordination.
Standards reflect current needs and go only so far as the committee’s
imagination allows. They must be renegotiated by all parties whenever
a new need arises. New uses may be held up until the specification is
amended. And the addition of new use cases exacerbate the complexity
from which ACAP already suffers.

Bide offers a general perspective on our conversation:

“I guess at its heart we have a disagreement here about whether the
owners of content (authors and publishers) have any right to decide
whether and how their content should be used — and then have a
mechanism to make those decisions transparent so that others may know
their intentions. At the moment, in so far as they wish to exercise
that right with respect to content that is openly available on the
Internet, they can express their policies only in multi-thousand word
sets of terms and conditions on their websites. Alternatively, they
can simply keep their content as far away from the Internet as
possible. You may believe that publishers are commercially unwise to
seek to exercise any control over the use made of their content — a
point of view but one with which honest men may disagree. The work
undertaken in the ACAP pilot project is a first step towards solving
that challenge.”

This looks to me like a narrow, negative defense of the project: one
based on publishers’ perceptions of problems in a new medium. I take
the blame here for pushing so hard with my criticisms that I made Bide
focus on the issue of content rights, rather than a more inspiring
justification promoting the hope of creating new business
opportunities. Let’s see how many search engines actually implement
ACAP, and whether the financial rewards make it worthwhile for both
search engines and publishers. I still place my bet on external sources of innovation
and collaboration with user communities–are those fostered by ACAP?

December 9:
James Grimmelmann of New York Law School’s Institute for Information
Law and Policy (who is a programmer as well as law professor) wrote a

critique of ACAP as a specification
that will interest people who have to deal with interpreting
requirements, and a
follow-up.

Popular topics:

What the Automated Content Access Protocol does

What is the goal?

Technical demands of ACAP

ACAP as a collaboration

TOC

Stay Connected

More O'Reilly Sites