Google DataWiki and RDF/XML

December 13th, 2010

While browsing Google Labs today just to see what’s going on over there, I ran across something rather interesting for the ReadWriteLinkedData folks:

I then coded up a nice little service which converts the Atom feed offered by DataWiki to RDF/XML, although it’s a little bit of a kludge. But I did have a few notes on the current state of things over with the DataWiki (many of these things are bugs and issues simply due to being in the early stages of DataWiki, so it’s not like they can’t be fixed!):

  1. Dereferenceability is good. Right now, although many of the URIs provided in the Atom feed can be dereferenced, most can’t even be dereferenced to XML, but only to HTML. The XML namespace of a particular wiki does not even link to a particularly useful description of the Wiki’s contents! Although this might be alright for a simple XML feed, it’s not so nice for RDF, which likes links which can be dereferenced to something meaningful.

    Pablo and the others working on DataWiki shouldn’t take this as a harsh criticism: this seems to be a common problem with people first introduced to RDF (and here, it wasn’t even in RDF to start!). Indeed we’ve had the same discussion with the guys over at Facebook when they put out their Open Graph Protocol, and successfully convinced them to put out a dereferenceable namespace URI.

  2. Describe your schema. Granted, this dovetails nicely with the previous point, in that, once the schema is dereferenceable, it should actually have meaning. Right now, the namespace of a particular wiki is just a base URI, without a fragment identifier. This causes problems if I want to refer to the particular schema properties (e.g. name, comment, etc.). Thus, I’ve taken the liberty of modifying the base namespace to add a fragment ID so as to be able to make the properties dereferenceable (even if they aren’t in practice). Following the reference should (ideally) give some description, however.

  3. Pushback should be easy. It should be possible to push back using any number of methods. I should be able to find out what the schema looks like and push (such as through WebDAV, or RESTful HTTP, or SPARQL) back to the server. Right now, it’s not trivial to actually push changes back with my converter, so I don’t do this. But then, this ties into the bug where, currently, I can’t get XML data for individual data items, and as such, can’t dereference them.

  4. Datatyping is nice, but only if it’s done right. Right now, there’s no way to determine how I should interpret a field. Part of this is because datatypes aren’t supported yet. Well, that’s fine, but I hope that in the future, I will be able to map whatever layman datatypes that are supported into corresponding XML Schema Datatypes and RDF Resources. The latter is especially important if there are ever links between data-items, or we lose the whole point behind a Wiki (or Linked Data!).

  5. Remember web applications! One thing I’ve tried to do with the converter is to keep it simple: it just uses an XSL transformation (so anyone can reuse it with their own DataWiki). I’ve also tried to play nicely with AJAX by enabling cross-origin requests (CORS). Hopefully DataWiki can do the same.

Anyway, this is rather exciting stuff, so I hope that this will keep rolling. As Semantic MediaWiki and the ReadWriteLinkedData efforts have shown (not to mention DataPress and RDF support in Drupal 7), there’s definitely a need to actually reveal meaningful data in a user-editable format. With people in Google starting to take to the idea as well, I don’t think it will be too long before we start seeing this idea really take off for the public at large.

TAAC, Cleaned Up and Ready

October 18th, 2010

So I’ve finally been able to get around to moving the TAAC demos to their final location (on DIG’s demo server, dice) and also porting the SVN history to TAAC’s new home in a Mercurial repository.

Now, you should be able to run the demos on a permanent home, and have a much easier time hacking on TAAC if you so desire. (As before, it still needs python-openid and it requires that you check out the air-reasoner repository in the same place to a directory named “tmswap”).

Work Continues!

September 13th, 2010

My apologies for being relatively silent on this blog, but I tend to be rather close-lipped when it comes to social media (blogs, Twitter, Facebook, etc.). In any case, those of you visiting this site may be interested to know about certain projects I’ve worked on in the past (especially TAAC and/or AIR), so I’ll try to do my best to keep you up to date.

  • TAAC: TAAC has unfortunately been suffering from code-rot for a while, so I am in the process of updating it and cleaning up the code to work again and get it into the Mercurial repository that now hosts the AIR reasoner, Tabulator, and other projects at DIG. I hope to get around to that by the end of the month. Also, as a result of a server move, I’ve switched the location of the demos (and will be switching them one more time once I set TAAC up on our demo server).
  • AIR Reasoner: The reasoner for the semantic-web rules language AIR is another project I’ve been working on (it is also used in TAAC). We’re in the middle of cleaning up the code and trying to set the features down to actually do a proper release of the reasoner with some explanatory documentation, but that goal is still a little elusive. You can check out our progress on the “refactor” branch of the DIG Mercurial repository at http://dig.csail.mit.edu/hg/air-reasoner/. We’re hoping to get this done by December.
  • “Dprop” and Distributed Data Propagation: This was my master’s thesis, which should be on DSpace at MIT sometime in the near future. The code for that is all uploaded to to the Mercurial repository at http://dig.csail.mit.edu/hg/dprop/, although it is not clearly documented. Some examples are present in the examples/ directory which may serve as guidance as to how to use dprop.

Well, that’s all I can offer for now. I may revise this post as I note other things I forgot, however, so stay tuned.

TAAC Update

January 30th, 2009

TAAC should now, with the latest update, support Apache 2 much more nicely (apparently, mod_python with it no longer nicely forwards SSL variables as environment variables, so you have to access them in a different manner). There are still some issues with Apache 2′s handling of SSL renegotiation and use of the SSLVerifyClient directive that need to be resolved, but the bottom line is that the demos should now be working again on this server (updated to Apache 2 in the past month.)

In addition, public SVN access to checkout the TAAC code should now be available using the username and password ‘dig’.

Open Letter to Whomever Stole My Bike

January 30th, 2009

For the sake of saving it somewhere, here is my brother’s “Open Letter to Whomever Stole My Bike”:

Open Letter to Whomever Stole My Bike

The Guy Whose Bike You Stole

Enjoy It.
Seriously, I mean it.
Some things I think you should be aware of though.

1. That thing was made by JC Penney (They made bicycles?) around the late 1970s; and has untold numbers of rust spots on it. Notice how the seat is up somewhat high on it? It isn’t because I’m tall, oh no; it’s because the nut is actually rusted on. You know, I once tried some WD-40, and it didn’t work on it either. Maybe you’ll have better luck.

2. The back brakes don’t work at all, while the front brakes squeak worse than anything else I’ve ever heard. Come to think of it, the wheels squeak just by virtue of spinning too.

3. The front reflector has a nasty habit of actually not being existent anymore, and with the strange mount that it uses? Good luck finding a replacement.

4. Notice how that front wheel wobbles a bit? Yeah, that won’t be going away.

5. The chain is far too long causing extreme pain whilst going up any incline; speaking of which…

6. Note how it displays proudly on the front that it’s a “five speed” bike?
It hasn’t been five speed since the early days of the Reagan administration.

7. The seat was not only placed far too high, but is also ripped in a rather awkward way. I hope you enjoy that especially.

8. The kickstand? Rusted over.

9. The bike “lock” on it? Yeah. The code wasn’t even resettable / was easily unlockable with a biro/pen.

You know, come to think of it, I never much cared for the color either. The shade of brown never quite suited me. In fact, I’m fairly sure that had the thing actually somehow gained sentience, that its first conscious act would be to commit suicide in the Mill Stream simply due to this fact alone.

It must be mentioned that I was planning on actually replacing it within short order, and was wondering how to properly dispose of said bicycle; luckily, you made it that much easier – no more will I actually need to consider abandoning it outside of a thrift store or behind someone’s truck in an ingenious scheme to gain insurance money.

Wow, this is actually the best theft over, no joke; now off to Craigslist to find a vintage Schwinn bike.

With Much Love,

The Guy Whose Bike You Stole

Quantifying cwm Variables…

December 18th, 2008

Mostly for my benefit, but here are a few examples of how cwm’s N3Rules translate into formal logic:

  • Global universal quantification:
    @prefix : <#> .
    @forAll :x  .
    
    { :x  :a :b . } => { :x  :c :d . } .
    
    :someValue :a :b .

    ∀x (a(x, b) → c(x, d))

    Therefore the above entails the additional statement :someValue :c :d . as :x is bound to :someValue on the RHS.

  • Global existential quantification:
    @prefix : <#> .
    @forSome :x  .
    
    { :x  :a :b . } => { :x  :c :d . } .
    
    :someValue :a :b .

    ∃x (a(x, b) → c(x, d))

    Therefore the above entails no additional statements.

  • LHS universal quantification:
    @prefix : <#> .
    
    { @forAll :x  . :x  :a :b . } => { :x  :c :d . } .
    
    :someValue :a :b .

    (∀x a(x, b)) → c(x, d)

    Therefore the above entails no additional statements.

  • LHS existential quantification:
    @prefix : <#> .
    
    { @forSome :x  . :x  :a :b . } => { :x  :c :d . } .
    
    :someValue :a :b .

    (∃x a(x, b)) → c(x, d)

    Therefore the above entails the additional statement :x :c :d . as :x is unbound on the RHS.

  • RHS universal quantification:
    @prefix : <#> .
    
    { :someValue :a :b . } => { @forAll :x  . :x  :c :d . } .
    
    :someValue :a :b .

    a(someValue, b) → (∀x c(x, d))

    Therefore the above entails (generally) @forAll :z . :z :c :d .

  • RHS existential quantification:
    @prefix : <#> .
    
    { :someValue :a :b . } => { @forSome :x  . :x  :c :d . } .
    
    :someValue :a :b .

    a(someValue, b) → (∃x c(x, d))

    Therefore the above entails the additional statement [ :c :d ] .

Finally, two trickier specific examples: “If there exists a foaf:Person that all (known) foaf:Persons foaf:know, then there exists a :P opularPerson” and, “any foaf:Person that is foaf:knows of all (known) foaf:Persons in a :P opularPerson” can’t be done properly without completely closing the world. cwm cannot do this without artificially closing the world through built-ins.

TAAC in Action

December 12th, 2008

TAAC Examples

I’ve posted three examples that utilize TAAC in some manner.

You can test any of these yourself if you present the proper client certificate linked to your FOAF file (otherwise, without a client certificate, you won’t be able to authenticate with FOAF+SSL.) If you don’t have a properly configured certificate or FOAF file, Henry Story has a short description of how you can set this up in Firefox 3 with some utilities in the sommer repository. In addition, this server requires you to explicitly provide a certificate (as client certificates are optional).

So How Does TAAC Work?

As mentioned previously, Henry Story has some excellent descriptions of how the FOAF+SSL protocol works in general. TAAC is merely an implementation of this, but goes further to implement an authorization framework. How does this work though? The following diagram goes a ways toward explaining TAAC’s design (especially with regard to authorization) in general.

(A diagram of TAAC)

TAAC acts as a proxy for any URI access within the directory it’s set up in (thanks to mod_python). On every access, it will check the requested URI against the list of URIs having an rein:access-policy (as populated from the file specified in the POLICY_FILE variable). If no access policy exists, TAAC gladly permits normal access without any needed authentication.

If an access policy exists, however, TAAC will immediately attempt to properly reach a successful completion of the FOAF+SSL authentication protocol. I won’t go into significant details here, as Henry Story gives an excellent overview of the protocol (in a somewhat earlier state, though the same principles still apply) on his blog.

Following this, TAAC takes the successfully authenticated URI-token and logs the attempted access to a log file (specified by LOG_FILE). Taking this generated resource describing the access, and the AIR policy attached with the rein:access-policy triple, TAAC then proceeds to run an AIR reasoner over the policy with the given log resource. If the resource describing the access is concluded to be air:compliant-with the associated access-policy, the fact that access was granted according to the policy is logged, and access is granted. Otherwise, the fact that access was denied is logged, and access is denied with a 403 response.

Authentication and Authorization on the Open Social Web with TAAC

December 11th, 2008

Update 3: The subversion repository for tmswap has been superseded by the Mercurial repository for air-reasoner

Update 2: The subversion repository should now have public checkouts enabled with the username and password ‘dig’.

Update: The subversion repository is currently not set up for external access. I probably won’t be able to get this resolved until Monday at the earliest. For the time being, you can extract this tarball into the directory you wish to protect, and skip the first two steps.

Recent discussions on the foaf-protocols mailing list have been pushing the FOAF+SSL protocol (discussed earlier on both Henry Story’s and my blogs) towards a more finalized state, pending some clarification of issues with generating the self-signed certificates that serve as the key to the protocol. As has been mentioned on the list and the blog several times, I have been maintaining an independent Python implementation of the FOAF+SSL implementation, and I now feel that the implementation is at a stable enough state to officially offer up instructions for installing TAAC.

Before I give instructions on how to do so, however, let me digress onto an important subtopic, that being the subtle difference of authentication and authorization, as the dichotomy is critical to understanding how TAAC works. FOAF+SSL is fundamentally an authentication mechanism. It provides a method to confirm that the individual presenting the SSL certificate is, in fact, the persons who is also in control of the FOAF resource specified in the certificate. It does not, however, specify any criteria for how access should actually be granted. It only establishes an identity.

TAAC implements FOAF+SSL as one of several authentication mechanisms tested, including a sample implementation of the RDFAuth mechanism, as well as an OpenID-based mechanism. TAAC, however, only implements these authentication mechanisms as a means to the goal of achieving a flexible Semantic-Web-friendly authorization framework. While the language and reasoning is still very much in flux, the idea of TAAC is to permit the creation of distributed access control lists and complex access control policies on top of semantic web data. Indeed, the current implementation (slowly) permits such authorization rules as “Only friends of people I specify as friends or the friends I specify can access this page” or “MIT students who are sophomores or juniors currently taking 6.805 can access this page” without having to maintain cumbersome access control lists, instead deferring to collections of data compiled by others. In effect, we can rely on MIT to maintain the list of current students, and accurately state their class year and the classes they are taking, such that we can merely reason over that data without having to compile an access control list from it.

Installing TAAC

Before You Install: Make sure you have installed the python-openid and pyCrypto >= 2.0.1 frameworks and are running mod_python on your server. While python-openid is not absolutely necessary for FOAF+SSL, TAAC is implemented with an additional vestigial OpenID mechanism that may or may not be integrated as an alternative mechanism to FOAF+SSL for FOAF-based authentication schemes, and hence requires the library

  1. Get the TAAC source code and copy the files and directories enclosed in the directory in which you want to protect some files. The source code is available in an SVN repository at https://svn.csail.mit.edu/dig/TAMI/2008/taac/proxy.
  2. Get the tmswap directory needed for TAAC to properly operate and copy it into the directory containing the TAAC code. The source code is available in an SVN repository at https://svn.csail.mit.edu/dig/TAMI/2007/cwmrete/tmswap.. The tmswap SVN repository has been superseded by the air-reasoner Mercurial repository at http://dig.csail.mit.edu/hg/air-reasoner/. Clone the repository and either:
    • Copy the tmswap directory (if it exists in the current revision)
    • Switch to the refactor branch and copy the contents of the root directory of the repository into a new tmswap directory
  3. Configure TAAC. The primary configuration for TAAC is in taac/config.py. You most probably don’t need to change any of the settings, but you should be aware of their setting, as it impacts the remainder of this installation process. POLICY_FILE is the relative path from proxy.py to the file that links your protected files to the corresponding policy files governing access. POLICY_TYPE is the MIME type of POLICY_FILE (‘text/rdf+n3′ or ‘application/rdf+xml’ most likely). LOG_FILE is the relative path from proxy.py to the file to log access information to. The other settings are not terribly relevant to FOAF+SSL and can be left alone.
  4. Setup your policy file. Your policy file (at the path specified by POLICY_FILE, defaulting to ‘./policies.n3′) is the key to protecting your URIs with FOAF+SSL. The policy file is an RDF file that links resources representing the protected URIs to their corresponding policy files. This is most easily done with the rein:access-policy (http://dig.csail.mit.edu/2005/09/rein/network#access-policy) property (subject to change in future TAAC releases). Here’s a very simple policies.n3 that protects my_file.html:
    @prefix rein: <http://dig.csail.mit.edu/2005/09/rein/network#> .
    
    <./my_file.html> rein:access-policy <./my_file.policy.n3> .
    
  5. Create a policy The policy is the access-policy attached by policies.n3. This policy is written in the AIR language, may be somewhat daunting for someone trying to write their first policy. A couple of sample policies include http://www.pipian.com/rdf/tami/juliette.policy.n3#JulietteLocationDissemPolicy, which permits any valid authentication via FOAF+SSL, and http://www.pipian.com/rdf/tami/juliette.policy.n3#JulietteFOAFDissemPolicy, which allows only friends and friends of friends of Juliette access.
  6. Create your log file with mode 0666. This is usually ‘log.n3′.
  7. Edit your .htaccess file. In order to actually enable the protection, you need to create a .htaccess file that actually adds proxy.py as a mod_python proxy and explicitly enables SSL client certificates to be passed to proxy.py. http://www.pipian.com/rdf/tami/htaccess is a good example for Apache 1.3 SSL servers. Apache 2.0′s mod_ssl requires somewhat different flags to enable passing SSL client certificates (melvin carvalho says that SSLOptions should be set to +StdEnvVars and +ExportCertData).
  8. TAAC should now be set up and running

The above instructions should work, but I have not officially tested them on a clean server.

It is worth noting that TAAC is still very much in flux and is alpha-quality software, and tends to follow the discussions on the foaf-protocols list rather closely, so the above instructions and configuration options may change without warning. Furthermore, there are some caveats with TAAC. In particular, it only currently allows for static policies and static protected URIs. It’s my hope to extend TAAC such that it will have hooks to allow for custom policies dependent on script arguments in the URL, no longer requiring static lists of all possible URIs (so protecting scripts is currently not likely to work well, especially if they take free-form arguments like session variables).

So that hopefully wraps it up a bit, and will get you started on getting a FOAF+SSL implementation set up of your own. TAAC may be clunky now, but the hope is to streamline it such that it’s easily integrated into any Python web application.

Issues with a FOAF-based Authentication System

September 5th, 2008

As I’ve been working on TAAC, I’ve started to become concerned about potential weaknesses with any FOAF-based identity authentication system (be it RDFAuth, OpenID, or FOAF+SSL) and that’s that ALL systems, with the possible exception of RDFAuth (due to its reliance on PKI), have their weakest link as the integrity of the server hosting the FOAF file. All three systems rely on data in the FOAF file to ‘authenticate’ against, but this poses problems. Take, for example, the following scenario:

Alice runs a website that accepts an OpenID+FOAF system (it works easily well with FOAF+SSL). Bob is a client of Alice, and regularly uses the authentication scheme Alice has implemented. When authenticating, he traditionally authenticates against his FOAF URI, http://www.example.com/bob.rdf#bob. The file bob.rdf has information that links to Bob’s OpenID, http://www.example.com/bob, permitting him to authenticate with his (self-run) OpenID provider.

Eve wants to see the information that Bob gets to see on Alice’s website, and thanks to some shoddy system administration, finds a security hole that allows her to get access to the filesystem. Ignoring the other private information acquired in this way, Alice silently replaces bob.rdf with her own FOAF file that has one simple change: the OpenID associated with http://www.example.com/bob.rdf#bob is now http://www.example.com/eve, which is Eve’s OpenID provider. Eve authenticates agains her own OpenID provider and gets access as Bob to Alice’s website, does her dirty work, and then quietly returns the original FOAF file so that Bob is none the wiser. There’s precious little evidence that Eve intruded, and only an alert sysadmin might note the erroneous login. Meanwhile, Alice is barely aware of any difference other than that the OpenID changed for one particular login.

In summary, as Henry Story admitted (Point 5 in the FOAF+SSL description), these methods only assert that the person accessing any protected resource has ‘write access’ to their FOAF file… But that doesn’t assert that they’re the same person.

With the common weakness of many self-hosted domains having poor security protocols, a FOAF-based Authentication System could be disastrous. The only plausible ‘stopgap measure’ might be requiring the system as a whole to cache the authentication credentials (e.g. OpenID, public key URL, or X.509 hash) and refuse access to people who present credentials that have changed. This adds a layer of complication to the mix as well, as it would require out-of-band communication to ensure that the ‘cached’ credentials are removed or replaced with new credentials manually… And even so, there is still the risk of incorrect authentication credentials being presented absent any evidence they are incorrect (e.g. Eve logs in before Bob ever does, or does so in the period where Bob’s cached credentials have been deleted, establishing her credentials in place of his own). There are ways around this, but they seem a bit kludgy to me (e.g. using the old OpenID/X.509 cert, which may not exist due to security risks, to authenticate the new one; checking against a public key server to see if there’s any indication that a public key has been revoked/replaced).

Are we sure that a FOAF-based Authentication System is secure enough? At the very least, it seems like we need proactive sysadmins maintaining the system to ensure it remains secure… And can we afford that?

Back to TAAC

September 3rd, 2008

So I’ve finally got a chance to return to working on TAAC, an access control mechanism for the web that integrates FOAF-based identification with access control rules. I’ve been doing some more thorough testing on the slow-down issues explained two posts back, and found that the slowdown, while significant, appears to be about 13 seconds or so, on average, on this server, a Linode virtual private server which I expect typifies an average web host (if not better than average).

Several attempts at profiling (aside from creating significantly increased processing times, up to 10x longer) led to the conclusion that, in fact, most of that time is spent in the second phase (post-authentication, during reasoning), which is where I’d EXPECT the slowdown to be. Granted, this now becomes a problem that can be solved in part by Moore’s Law, but even so, some speedups would be nice to allow it to be implemented today. I plan on running the same code on a relatively modern test server that’s dedicated to doing more or less supporting these tests, so it will likely run faster on there.

It’s worth considering that this is running on a variant of the cwm reasoner on top of a re-implemented Rete reasoner, and, seeing how it’s all in interpreted Python, rewriting it in compiled C code (or even Java) would probably see a significant speed-boost, but that’s not a terribly productive line of work (except where trying to actually push out a commercial product). It might also be worth exploring other reasoning approaches to improve the speed.

Even so, I’m going to try looking at the other authentication approaches to see what the benefits and costs of them are… I think the more RESTful approach without OpenID may have some arguments in favor of it, but I doubt they’re going to be based solely on speed.

« Older Entries