Skip to navigation
Login|Register
Log In

Remember me

RSS Feeds

Beyond recognition

The idea of identifying the author of a text based on linguistic analysis of other documents they have written is not a new one. Known as forensic linguistics, these techniques have been used in many UK court cases. In the US, an analysis of two texts was used to prove that Theodore Kaczynski was the Unabomber, responsible for killing three and wounding 29 in a protracted campaign of mail bombings.

Authorship analysis systems have included analysing how often someone makes the same grammatical or spelling mistakes, word length and so on. Computerisation of these techniques lends itself readily to software that generates data structures called n-grams, which are words that follow each other in sequence. By comparing the composition and frequency of n-grams, scientists have been able to spot everything from plagiarism in college papers to the authorship of books of the Bible.

Message parlour

Traditional forensic linguistic analysis produces statistics from any given passage of text. The Arizona team decided to visualise the patterns made by their analyses. Though there have been other attempts to create such visualisations, Abbasi and Chen felt that none was specifically geared to detecting online deception.

The basis for their new approach is principal component analysis; see below for details. A web spider accesses and stores in a large database all the messages left on a number of public web forums. Feature-extraction software analyses the captured messages for a rich set of features. This includes parts of speech such as adjectives and adverbs, plus more statistical concepts such as sentence complexity.

Next, a program running a sliding window algorithm generates more data based on the data extracted already, like a biologist culturing bacteria from a tiny sample to see it more clearly. The result of this amplification process is the writeprint itself. The system of x and y coordinates for the generated data are calculated. These coordinates are displayed and an organic-looking picture emerges. Onscreen, the software splits the writeprint into six parts, each containing a unique shape. These in turn cover the distribution of word lengths, the use of punctuation and special characters, sentence structure, letter frequencies and even jargon.

When presented with a sample of text that is supposed to have been written by a particular person, calling up his writeprint and taking the writeprint of the text sample could identify him more easily than a fingerprint test and with a great deal of accuracy. When given an anonymous message, it's easy to generate a writeprint for it and search the database for a similar one.

When presented with just one example of a message that's known to have come from a user, writeprints proved less effective at recognising other texts from the same author than a computer running support vector machine (SVM) pattern-recognition software. However, when given 10 or more such messages, writeprints were 100 per cent accurate, where the SVM system runs at around 90 per cent.

As with the other biometric profiling ideas, however, there are still problems to resolve. "[Writeprints] are constrained when dealing with shorter individual messages; that is, messages of fewer than 30 to 40 words," said Abbasi. "This is due to the minimum length needs of the sliding window algorithm." Some common written communications such as text messages are too short for sliding window algorithms to amplify accurately enough.

1 2 3 4 5
< Previous   Features : General Next >
Be the first to comment on this article

You need to Login or Register to comment.

(optional)

For more details about purchasing this feature and/or images for editorial usage, please contact Jasmine Samra on pictures@dennis.co.uk

advertisement

Aeris Muvman review

Aeris Muvman

Category: Gadgets
Rating: 4 out of 5
Price: £341
Kingston Ultimate 64GB SDXC review

Kingston Ultimate 64GB SDXC

Category: Gadgets
Rating: 3 out of 5
Price: £110
Logitech HD Webcam C270 review

Logitech HD Webcam C270

Category: Gadgets
Rating: 5 out of 5
Price: £16
Symantec Norton Online Backup review

Symantec Norton Online Backup

Category: Software
Rating: 2 out of 5
Price: £40
Samsung High Speed microSDHC card review

Samsung High Speed microSDHC card

Category: Gadgets
Rating: 4 out of 5
Price: £11
 

advertisement

Sponsored Links
 
Computer Shopper

advertisement


advertisement


 
 

Expert Reviews Printed from www.expertreviews.co.uk

Register to receive our regular email newsletter at http://www.expertreviews.co.uk/registration.

The newsletter contains links to our latest PC news, product reviews, features and how-to guides, plus special offers and competitions.