thinkingmachine
computers, people, and interaction  
Registration and Unintended Consequences

Usually the law of unintended consequences decrees that things will be much worse than expected, but every now and then there's a pleasant surprise. Story from Topix via Greg Linden. I won't rehash the story in too much detail, but the upshot is that Topix got rid of their registration requirement on their user forums and not only did participation increase dramatically (as expected), spam rate actually decreased (pleasant surprise).

The thing is, requiring registration has some unintended consequences of its own. In particular, lots of people won't do it. And, on the other hand, spammers, trolls, and immature jerks are happy to do it. So (proportionately at least), you end up with more crappy posts. These observations are from the "2ch principles"; 2ch is an anything-goes web forum in Japan, and people have modeled their forums on 2ch's, having observed the benefits of open posting.

Another one of the mentioned principles is "anonymity counters vanity". The idea is that registered users will become cliqueish, protecting their turf and jockeying over pride and identity. An open, more anonymous system essentially gives less reward to pride of identity, and posts are more topic-focused. Though this seems to counter another bit of web conventional wisdom, which is that reputation is an incentive for good behavior. The 2ch observation is that, at least, reputation is a mixed blessing; people will behave better when they want to protect their reputation, but they will also spend more time maintaining, promoting, and jockeying over that reputation. The cost/benefit tradeoff is probably different for different uses. On eBay, reputation is needed to provide for some trust when making costly exchanges; on a message forum, perhaps it's just a distraction.

Blog Spamming Gets Worse

CNET reports on a sudden flurry of blog spam activity, apparently due to one clever and obnoxious spammer. It's not email spam, comment spam, or trackback spam, but thousands and thousands of blogs created on Blogspot with links to the spammer's sites embedded in text snagged from popular blogs.

It's pretty annoying, not to mention depressing, what with the apparently eternal arms race we're locked into here. But the most surprising thing in the article to me was that Google doesn't seem to do any serious account verification on its Blogspot service (or didn't until last week). It's not like captchas are top secret advanced technology -- pretty much everyone uses them now. Where was Google?

The other thing I don't quite get is: why did all these crap blogs create so much trouble? I mean, it's the nature of the web that there's all kinds of crap out there -- some spammer adding more crap sites shouldn't make it appreciably worse. This isn't like spam mail or comment spam, where someone is shoving their message into your inbox. The problem here seems to have come from the fact that the spammer cleverly made all his fake blogs highly appealing to search engines. They started to appear in people's search results and RSS feeds (why? do people have RSS feeds of open searches?), and that caused the problem.

So, why? You know, if people had been doing that search on Google, I don't think they would have gotten all those crap results, because Google takes into account reputation (in terms of incoming links) in its results rankings -- crap spam sites that are only linked to by other crap spam sites shouldn't get a reputation boost. So to Technorati and PubSub and so on: do what Google does. Not that Google is infallible (see above about captcha), but these blog search engines are talking about blocking Blogspot from their results. Which will work precisely as long as spammers don't crack other blog hosting sites. Anyway, this is probably going to get worse before it gets better.

KDD2004: Data Mining and Spam

Pedro Domingos (from UW) presented a paper (co-authored with a number of UWers) about data mining in the presence of an adversary who is deliberately trying to deceive the data miner. This was the big hit of the conference so far. He made the point that this happens in many cases -- spam detection, intrusion detection, counterterrorism, etc -- where there is an adversary who can alter the data to prevent the data miner detecting what he seeks to detect. He argued that this problem has not been addressed before in the data mining field but is interesting and important.

Continue reading "KDD2004: Data Mining and Spam" »
Learning Models of Human Behavior

Wouldn't it be great if someone could develop a way to mine the web to figure out which activities are typically used in a given activity? Maybe one way that they could do it would be to look at how often an object term shows up in a Google query when paired with the related activity. If you compared that to how often the object showed up in general then maybe you could get a probability of the object's use when performing that activity.

Of course this would completely fail if there were web-pages that mentioned activities in the same breath as objects which were completely unrelated. For example, if there were a web-page somewhere that suggested "I like to eat tea-bags when I use the toilet", or "I frequently find that my television viewing is enhanced by sleeping with a vacuum", or "last night I changed a baby's diaper with a wooden spoon and a jar of peanut butter strapped to my dog" Fortunately for such a hypothetical research project though, there aren't any web-pages that say things like that....

What is Programming by Demonstration?

In the AI field people generally know what you mean by the term "Programming by Demonstration." When you pin them down on the definition it seems to settle on: A computer learning a macro from what you are doing repetitively in a text editor.

But at a higher-level, more general level, what is it? All a computer can do is execute a program, so any machine learning is "programming" a computer. And all learning is learning from demonstrations of things. So from at least a linguistic standpoint, the phrase Programming by Demonstration seems pretty meaningless. Judging from the body of work that calls itself PBD though I would say that it has the following qualities:

1) "It" learns from a very small number of examples (like maybe 1).
2) "It" learns a procedural language from the examples.
3) "It" operates in a discrete environment without non-determinism.

Discuss amongst yourselves...

Similar words, categories, and ontologies

An interesting post on grammar in Agoraphilia raises some interesting questions about categories and ontologies. Perhaps it's just because we've been thinking a lot about ontologies at Intel Research lately, with a view to classifying everyday objects that people use in their daily activities. In his post, he also references Eleanor Rosch, a pioneering cognitive scientist who has thought a lot about categories. All of which reminds me that beyond the practical questions of using ontologies in applications, there are all these interesting issues to philosophize about.

WWW2004 thoughts

WWW is always an interesting conference. The range of relevant topics is quite wide, from cache-and-network type stuff for optimizing performance to speculative artificial-intelligence-type ideas, to sociological analysis and theory of what people actually do on the web. And so, going from one poster to another, or slipping out of your usual track into some other talk can be surprising, with the sometimes benefit of jolting you into a new idea.

The other interesting thing about WWW is that it does represent, in some ways, much of the brainpower at the center of web developments. Many of the people involved in standards and so on are there, and many of today's papers will be tomorrow's hot new ideas. On the other hand, so much of what happens on the web and affects it for regular users makes no appearance among the pointy-headed types at all. Some of it is just secretive (e.g., Google is well-represented at these things, but they never talk about what they're doing) and some of it just pays no attention to research papers (e.g., most ecommerce, publishing, and daily stuff that people use).

The tension here appears all the time, for example in the contrast between all the cool research ideas people have for search and data extraction, and what people actually do every day. Or the contrast between what WWWers hope to do with the semantic web and the reality of how much attention span most people have for such complexity. Or in the way that search engine response time and ease-of-use has basically eclipsed many clever ideas that would be too costly to add.

If pressed, I'd say this kind of contrast appears in many areas of computer science, the tension between what researchers can think of and what people actually will/do use. But it's all much more obvious at WWW, perhaps just because it's so widely used and develops so quickly and seems more like a force of nature (or, at least, an organic entity like a city or a nation) than a human-designed artifact. A lot of the time, we are just trying to keep up with its relentless development.

memex

"Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and to coin one at random, "memex" will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory."

-- Vannevar Bush, 1945

(quoted by Udi Manber at www 2004, and by many others)

www2004: themes

themes at www2004: semantic web, learning/information extraction, search.

Continue reading "www2004: themes" »
Smart glasses detect eye contact

New Scientist: Smart glasses detect eye contact

Now this is an interesting idea. Though even aside from the laughably ugly glasses, it only works within a meter -- and if you can't tell if someone is making eye contact with you from a meter away, you need more help than smart glasses can provide.