Monthly Archive for October, 2005

Spreading Activation

This idea follows some research into memory that says that the probability of an event to happen is inversely proportional to the time that passed since it last happened.

The idea started from the assumption that memory contains concepts. These are associated. Those that are more ‘activated’ are easier to recall. The activation of a given concept depends on two things: the base activation of that concept and the sum of the activation of associated concepts multiplied by the strength of association.

Memory-Model

Copyright notice: the present content was taken from the following URL, the copyrights are reserved by the respective author/s.

Tags: , ,

A learner-indipendent evaluation of the usefulness of statistical phrases for automated text categorization

M. F. Caropreso, S. Matwin, and F. Sebastiani. Text Databases and Document Management: Theory and Practice, chapter A learner-indipendent evaluation of the usefulness of statistical phrases for automated text categorization, pages 78–102. Idea Group Publishing, Hershey, US, 2001. [url]

——————————

This paper present a technique and proof to use a n-grams indexing technique for Text Classification (TC). A phrase is a textual unit usually larger than a word but smaller than a fuill sentence. Phrases have a smaller degree of ambiguity than their constituents words, thanks to the mutual disambiguation effect of words.

An n-gram is an alphabetically ordered sequence of unigrams. The authors’ learner-indipendent study has shown that feature evaluation functions routinely used in the text categorization experiments tend to score many bigrams higher than unigrams that they would themeselves select in unigram-only feature selection task, sometimes giving rise to a too high bigram “penetration level”.

Tags: ,

KidPad: a collaborative authoring tool for children

KidPad is a collaborative story authoring tool for children. It provides basic drawing functionality on a zooming canvas enabled by Jazz. The narrative structure of a story is defined by creating spatial hyperlinks between objects on the canvas. Instead of using a standard WIMP (Windows, Icons, Menus, Pointer) user interface, KidPad uses local tools that can be picked up, used and dropped anywhere on the drawing surface. The local tools interface and MID, a Java library developed at the University of Maryland, allows KidPad to support shoulder-to-shoulder collaboration. If multiple USB mice are connected to the computer each mouse will control a tool in KidPad, making it possible to let several children simultaneously create a story together!

Kidpad1S

Tags: ,

Nearest Neighbor Decision Rules with Proximity Graphs

I found an interesting introductory course on this theme (via), which summarize different techniques for decision rule class membership attribution. These can be divided into parametric and non parametric depending on the fact that the choice for the assignment is computed using the probability or an euclidean metrics criteria.

Interestingly enough, besides being extremely simple to implement, the probability of error of those non-parametric variants, is sufficiently close to the Bayes (optimal) probability of error. Sergei Savchenko, presents in the document cited above, a compilation of those methods, in an order based on the efficiency and the accuracy for the information retrieval.

The first presented is the k-nearest neighbor decision rule, which classify an unknown object to the class most heavily presented in its nearest neighbors.

 ~Godfried Teaching Projects.Pr.98 Sergei Figure Figure2

The Voronoi diagram is a partition of space into regions, within which all points are closer to some particular node than to any other node.

 ~Godfried Teaching Projects.Pr.98 Sergei Figure Figure3

The Dalaunay triangulation is a dual of the Voronoi diagram. It two Voronoi regions share a boundary the nodes of those regions are connected with an edge. These nodes are called the Delaunay neighbors.

 ~Godfried Teaching Projects.Pr.98 Sergei Figure Figure4

The Gabriel graph of a series of points is a subgraph of Delaunay triangulation for that set. Gabriel editing reduces the size of the training set even more than Voronoi editing.

 ~Godfried Teaching Projects.Pr.98 Sergei Figure Figure10

Tags: ,

Indexing by Latent Semantic Analysis

S. Deerwester, S. T. Doumas, T. K. Landauer, G. W. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6):391–407, 1990. [url]

—————————-

This is the first technical LSI paper. It offers a good background on the previous theory and on the technical choices behind the algorithm.

Essentially, the authors developed the technique to overcome the deficiency of current information retrieval methods, that is the words searchers often use are not the same as those by which the information they seek has been indexed. The two side of the issue they describe are the synonymy and the polysemy. The former is the fact that there are many ways to refer to the same concept. The latter is the fact that most words have more than one distinct meaning.

The criteria they used to distinguish between different models are: adjustable representational richness (hierarchical clustering is too restrictive; they looked for models whose power could be varied); explicit representation of both terms and documents (as in two-mode factor analysis or tree unfolding); computational tractability for large datasets.

The latent semantic structure analysis starts with a matrix of terms by documents. Then this is analyzed by Singular Value Decomposition (SVD), to derive the particular latent semantic structure model. The process divide the original matrix into three representative matrix, which contains “eigenvectors” and “eigenvalues”, and which show a breakdown of the original data into linearly indipendent components or factors.

Three particular comparisons were of interest for the authors: the comparisons of two terms; two comparisons of two documents; and a comparison of a term and a document.

Tags: , ,

dogpile: a comparison tool for search engines

I tried an egocentric search on this web service that promise to take the best of the search engines. One of their claims is that only 1,1 % of the top links are shared between the four search engines (MSN, Google, Yahoo, AskJeeves).

Dogpile

Tags: , , ,

Using implicit feedback on STAMPS project

Today I had a meeting with my advisor on possible direction to move the project forward. We discussed the option to use a semantic distance between the messages as another dimension for the triggering of significant connections among the messages. One of the options would be to parametrize the influence of the semantic or the geographical distance on the final clustering of the information. The distortion over one of these two dimensions should be finally decided by a combination of automatically computed factors and the user feedback.

Two ways are possible to incorporate the user feedback. The first is asking for explicit feedback (i.e., Is it what you were looking for – YES —- NO); the second way is to use implicit feedback measurements to asses whether the behavior of the user is affected by the visualization and how this respond to the goodness of the visualization itself.

Tags:

low-costs revenue margins

I found this interesting information on Beppe Grillo’s blog. It is an official document by Ryanair summarizing the revenue margins for different companies where Alitalia is taken as the lowest point in the scale :-(

Here is clearly visible how the company is loosing money every single day. In the mean time the direction of the company keep having the highest salary in europe. Povera Italia!

Ryanair Leading-Margins

Tags: ,

Emergence of keywords polygons

I have been quite this period trying to figure out a way to group visually messages that refers to the same keyword. After a couple of bad attempts I tried to visualize a polygon connecting the messages referring to the same keyword. Subsequently, I tried to find the centroids of the polygon using this nice formula. Then I decided to attach the keyword to this “center of gravity”. This seems to be simple but it was damn hard, as I had to find the good sequence to connect the points of the polygon to avoid self interceptions of the lines.

Anyway, there are still some flaws in the code but here is the first rendering to comment and destroy. All comments are welcome.

Map 20051026162040

Tags: , , , , , , , ,

StickyShadows: a pervasive virtual messaging system

StickyShadows is another Location Based System that allows the user to leave a virtual message to a certain physical place. Their base concept is to give more relevance to alerts to prompt the retrieval of the information.

Tags:

3D Display Cube

James Clar has realised a three-dimensional display cube that permits to visualize three dimensional interactions or objects moving in space. I have always been fascinated by simple forms and this one it is. Maybe this simple idea can help solve some of the issues of three dimensional projectors and holography.

B2

Tags: ,

A Contextual Network Graph

Contextual Network Graph

The contextual network graph does not encode any information about grammatical or hierarchical relationships between terms.  Its structure is determined purely by term co-occurrence across the collection.

Each edge in the graph has a strength assigned to it whose magnitude depends on our choice of the local and global term weighting scheme used in generating the TDM.  The only constraint on weighting schemes is that all edge weights must fall in the interval (0,1).

From:

M. Ceglowski, A. Coburn, and J. Cuadrado. Semantic search of unstructured data using contextual network graphs. Preliminary white paper, National Institute for Technology and Liberal Education, Middlebury College, Middlebury, Vermont, 05753 USA, 2003.

Tags:

Semantic Search of Unstructured Data using Contextual Network Graphs

M. Ceglowski, A. Coburn, and J. Cuadrado. Semantic search of unstructured data using contextual network graphs. Preliminary white paper, National Institute for Technology and Liberal Education, Middlebury College, Middlebury, Vermont, 05753 USA, 2003. [url]

————————————

This paper describe the Contextual Network Graph, a technique that should solve some of the pitfalls of the latent semantic indexing, like the poor scalability of the singular value decomposition algorithm.

The authors offer an alternative interpretation of the term-document matrix (TDM), which is essentially a lookup table of term frequency data for the entire document collection. In LSI this is interpreted as a high-dimensional vector space. Alternatively, this can be seen as a bipartite graph of term and document nodes where each non-zero value in the TDM corresponds to an edge connecting a term node to a document node. In this way, every term is connected to all of the documents in which the term appears, and every document has a link to each term contained in the document. The authors call this a contextual network graph.

This construct correspond to the intuition that documents that share many rare tems are likely to be semantically related.

Their idea is to use this representation to search a document collection energizing a query node and allowing the energy to propagate to other nodes along the edges of the graph based on a set of simple rules.

Their results show comparable results to an LSI search engine. In a 1981 dissertation at the University of Illinois, Scott Preece describes an almost identical technique under the name spreading activation search.

Tags: , , ,

A couple of directions to move forward my project

At this stage I feel pushed more by exploration and curiosity than a precise strategy. I know where I have to go and which point I have to reach, but on the exact path to take things are still uncertain. Now I can see a couple of options:

step zero -> Computing the semantic distance matrix using Latent Semantic Indexing.

1. Using this matrix to sparsify the gabriel graph of the geographic layer;

2. Merging the semantic and the geometrical matrix to be used in an ad-hoc clustering algorithm;

3. Computing the semantic or geometric clustering and apply the other matrix to mining the results.

Tags:

My first beta-skeleton

Finally I managed to compute the b-skeleton of the messages left of the campus of EPFL. The idea is to use this visualization to find the messages that are more close to the others. Additionally this graph can be sparsified, meaning that some long / irrelevant connections can be removed, clustering the zones in sub-graphs. This technique is also useful to find outliers.

Beta skeletons: A continuous family of proximity graphs proposed by Kirkpatrick and Radke [*]. These graphs are defined by a positive real parameter Beta. Lower values of Beta give denser graphs (with more edges). There are two types of Beta-skeletons: lune-based (the neighborhood is an intersection of discs), and circle-based (the neighborhood is a union of discs). For Beta=1 the skeleton is in fact the Gabriel graph, and for Beta=2 the (lune-based) skeleton is the Relative Neighborhood graph.

Map 20051019151507

[*] DB Kirkpatrick and JD Radke.

A framework for computational morphology.

In GT Touissant, editor, Computational Geometry, pages 217-248. Elsevier, 1985.

Tags: , ,