Beyond recognition
Posted on 19 Jan 2007 at 16:31
The idea of identifying the author of a text based on linguistic analysis of other documents they have written is not a new one. Known as forensic linguistics, these techniques have been used in many UK court cases. In the US, an analysis of two texts was used to prove that Theodore Kaczynski was the Unabomber, responsible for killing three and wounding 29 in a protracted campaign of mail bombings.
Authorship analysis systems have included analysing how often someone makes the same grammatical or spelling mistakes, word length and so on. Computerisation of these techniques lends itself readily to software that generates data structures called n-grams, which are words that follow each other in sequence. By comparing the composition and frequency of n-grams, scientists have been able to spot everything from plagiarism in college papers to the authorship of books of the Bible.
Message parlour
Traditional forensic linguistic analysis produces statistics from any given passage of text. The Arizona team decided to visualise the patterns made by their analyses. Though there have been other attempts to create such visualisations, Abbasi and Chen felt that none was specifically geared to detecting online deception.
The basis for their new approach is principal component analysis; see below for details. A web spider accesses and stores in a large database all the messages left on a number of public web forums. Feature-extraction software analyses the captured messages for a rich set of features. This includes parts of speech such as adjectives and adverbs, plus more statistical concepts such as sentence complexity.
Next, a program running a sliding window algorithm generates more data based on the data extracted already, like a biologist culturing bacteria from a tiny sample to see it more clearly. The result of this amplification process is the writeprint itself. The system of x and y coordinates for the generated data are calculated. These coordinates are displayed and an organic-looking picture emerges. Onscreen, the software splits the writeprint into six parts, each containing a unique shape. These in turn cover the distribution of word lengths, the use of punctuation and special characters, sentence structure, letter frequencies and even jargon.
When presented with a sample of text that is supposed to have been written by a particular person, calling up his writeprint and taking the writeprint of the text sample could identify him more easily than a fingerprint test and with a great deal of accuracy. When given an anonymous message, it's easy to generate a writeprint for it and search the database for a similar one.
When presented with just one example of a message that's known to have come from a user, writeprints proved less effective at recognising other texts from the same author than a computer running support vector machine (SVM) pattern-recognition software. However, when given 10 or more such messages, writeprints were 100 per cent accurate, where the SVM system runs at around 90 per cent.
As with the other biometric profiling ideas, however, there are still problems to resolve. "[Writeprints] are constrained when dealing with shorter individual messages; that is, messages of fewer than 30 to 40 words," said Abbasi. "This is due to the minimum length needs of the sliding window algorithm." Some common written communications such as text messages are too short for sliding window algorithms to amplify accurately enough.
For more details about purchasing this feature and/or images for editorial usage, please contact Jasmine Samra on pictures@dennis.co.uk
Find a review
advertisement
Aeris Muvman
Category: GadgetsRating:
Price: £341
Kingston Ultimate 64GB SDXC
Category: GadgetsRating:
Price: £110
Logitech HD Webcam C270
Category: GadgetsRating:
Price: £16
Symantec Norton Online Backup
Category: SoftwareRating:
Price: £40
Samsung High Speed microSDHC card
Category: GadgetsRating:
Price: £11
- Samsung Galaxy Tab 2 announced
- Toyota Yaris Hybrid confirmed for Geneva show
- Seat Exeo and Exeo ST 2012 launched
- Valve's Steam hit by power outage
- The Meep! is an Android tablet for kids
- Europcar to offer Nissan Leaf for hire in London and Paris
- Samsung Tocco Lite 2 launches in UK
- Asus O!Play TV Pro media streamer introduced
- Chevrolet Cruze Station Wagon unveiled
- Ford B-Max and Easy Access Door System demoed
Software Store
advertisement

