Wednesday, October 26, 2005

Meeting Media Monitoring Users

I've had a chance to meet with some clients over the past few weeks to talk about how they are using media monitoring tools (both ours and those from other companies) and how they're working the metrcis they gather into their information-employees' workflows. It's always interesting to get the perspective of those who are buying the tools Factiva and others are touting. It helps focus us on delivering value, not just building tools.

I really feel strongly that market validation is a vital piece of product development. Typically when you talk to users you'll hear feedback that's not unexpected. But what keeps it interesting is that you always get some comment, some new perspective, some advice you weren't expecting.

It seems that for every client, there's a unique use case.

Friday, October 21, 2005

Who Are These Sploggers, Anyway?

An interesting post over at the Intelliseek blog about the growing problem of blog spam, written by Intelliseek employee Robert Stockton, who describes common behaviors and methods of this growing menace.

I'm not sure we need another word to describe them though. "Sploggers"? Ugh. But from a linguistic perspective, it's fascinating how words form so quickly in cyberspace.
web log > weblog > blog > blogging > blogger > spam + blog = splog > sploggers

Tuesday, October 18, 2005

BlogOn: The Oft-Mentioned Long Tail

"The long tail" was mentioned at BlogOn 2005 several times by presenters and overheard in the hallway as well. One of the first to talk about this concept as applied to the Blogosphere was Clay Shirky.

Basically, this is the idea that most of the traffic in the blogosphere is coming from a very small number of authors and a very large number of authors (the long tail) are creating on average a small number of posts each.

This idea is tied to Zipf's law, named after George Kingsley Zipf, a Harvard linguistic professor. Jacob Neilsen also recently wrote about Zipf's curves.

However, it was pointed out by presenter David Weinberger that the area under the long tail is larger than the area under the large head, as it were. Which means... what exactly?

BlogOn: Podcasting to Text

The question of whether there is value in having podcasts transfered to transcripts came up during a panel discussion at BlogOn 2005 conference. The panelist, Michael Geoghegan, of Willnick Productions, being asked about the value of speech-to-test for podcasts said he saw no value in this because his podcasts are meant to be heard and that the emotion that comes across in his voice would be lost in transcript form.

I think he's missing the point. Once podcasts are transcribed they can be searched and text mined. This adds additional use to the podcasts that otherwise have a limited distribution. Without being able to search or mine podcasts most of their usage is going to come from browsing and category searching. For example, if I search for "Pinot Gris" in a podcast search engine I will likely miss the podcast that mentions "pinot gris" because the podcast description might not mention specific grapes and wines.

BlogOn 2005: A Diverse Attendee List

I'm at day two of BlogOn 2005 in New York City. Presentations and panels have leaned toward discussing how the Blogosphere and the business world are coming together.

The attendees are quite diverse. There seem to be a mix of Blog geeks and newcomers to the space ("what's podcasting?" one attendee asked a panel). One woman I met has never posted to a blog before but was here because her boss instructed her to find out more about the industry.

Many people here seem to be vendors, industry analysts and journalists. And there are a surprising (to me) number of PR, marketing and advertising professionals here. It seems those industries are trying to get up to speed quickly on this growing internet-based conversation is all about.

I think the conference will have to evolve next year to be more useful. The topic of "blogs" is too vague to support well focused show.

Thursday, October 13, 2005

Google and Information Extraction

Google "information extraction" and what's the No. 1 sponsored link?

Work at Google

Google is hiring expert computerscientists and software
developers!www.google.com/jobs


I've never really though about Google being a player in the information extraction sector (aka entity extraction, text analytics, text mining). Sure there's lots of talk about what's the next big thing for them -- free wireless access, video search, indexing the world's libraries. It's fun to think about that stuff.

But when it comes to improving their bread and butter -- search -- mostly I picture their focus being on refining their ranking algorithms and optimizing their crawling strategies. But on your way to being the one-stop shop for all information, I guess it should have been obvious to me that text mining would be a station on that route.

One place we see TM showing up clearly is in Google News, with the "In the News" list of oft-used phrases of the day. I'm sure there are many more examples.

Monday, October 10, 2005

Lies, damn lies and text mining statistics

LA Times writer Brendan Buhler took TV newswriters to task Sunday for their overuse of the phrase "get a handle". OK. I hadn't noticed such a growing phenomena, but no matter.

He proved his claim by running a LexisNexis search on the phrase over five years and said its use is rising every year. ("It was in 3,504 stories in 2004, nearly 700 more than 2000. ")

I found this to be creating truth where none exists by a fast use of text mining. I have two concerns:

1) What were the context of these references? I searched "get a handle" in Factiva and found several mentiones of that phrase in a oft-sited direct quote ("Firefighters were able to get a handle on this early on," said Capt. Jason Neuman of the California Department of Forestry and Fire Protection.) Does that make the phrase more common or is it just a function of the phrase being replicated by the distribution of AP wire copy.

2) Did Mr. Buhler account for any changes in the universe of publications and/or documents over that time period? The number of mentions in one year versus another needs to be compared to the total documents in each year. When I ran the "get a handle" search in Factiva's top 50 U.S. Newspapers (a more controlled group) and then compared it to all documents each year in that group, I found the rate of mentions of the phrase rather flat year on year.

Ah. Lies, damn lies and statistics.

Saturday, October 08, 2005

Google News: A study in text mining

You've got to wonder how much the large media companies like CNN and BBC who each have their own firm presence on the Web hate Google News. Sure GN is pointing readers back to their sites so they get the eyeballs but it's just as likely that a user will click through to kentucky.com as CNN.com, (though there's evidence of late that GN is weighting the top sources higher now and kentucky.com is less likely to be the lead link) when previously that same reader would have just typed cnn.com into their browser and got their news from one source.

You get the feeling the relationship might be one of smiling through gritted teeth.

Google News is using the power of text mining to leverage the editorial might of many editors and news rooms. CNN, AP, NY Times, FT are all making decisions of which news item is most important and arranging their landing pages accordingly. They're paying human editors to make these value judgments. GN comes along and in the aggregate scoops up all this knowledge (text mining!) and creates a viable competitor for the best news sites out there out of whole cloth.

The irony is that GN needs its news providers for the knowledge of what's most important. So it needs the CNNs of the world to remain successful so they can keep feeding off them. CNN's ad supported model needs the clicks. Symbiotic? Perhaps, but I think Google is benefiting more.

Tuesday, October 04, 2005

More Forum Follow-Up

Here's perspective from Christopher Kenton, SVP of Strategic Planning at GlobalFluency, on his participation in a panel discussion at Factiva Forum last month. The session was called Blogs and RSS: Friend or Foe, Fad or Future. Also on the panel were Sandy Hamilton, EVP of Sales and Marketing at NewsGator; James Brancheau, Managing VP at Gartner; and David Scott, Author of "Cashing in With Content". Chris's summary: the panel presented a "broad and optimistic view of the power and utility of unstructured and unfiltered content, even in the face of significant challenges."