Data, information, knowledge, and processing

1.1 Data, information and knowledge

Data:

Data is basically a collection of raw facts and figures that require processing before they are interpreted it does not have any meaning until it is processed and a context has been given to it.

The number 5 could be considered as a data

Information:

Information is data that is presented within a context after it has been given meaning, relevance and purpose.

With time data can be contextualized, categorized, calculated and condensed for specific purpose making it a valuable information.

The context could be prime numbers and 5 is a prime number.

Knowledge:

It is basically the information to which human experience has been applied. It basically requires a person to understand what information is, based on their experience and knowledge base.

In order to know that number 5 is a number one should know that prime numbers have only two factors therefore this can be considered as knowledge.

1.2 Sources of data

Data can basically be classified as two types static and dynamic.

Static data:

Data that does not normally change.
Static means ‘still’.
It is either fixed or has to be changed manually by editing the document.

Dynamic data

Data that changes automatically without user intervention.
Dynamic means moving
It the data that updates as a result of the source data changing
Dynamic data is updated automatically without user intervention

Static data sources compared with dynamic data sources

Static information source	Dynamic information
The information does not change on a regular basis.	Information is updated automatically when the original data changes
The information can go out of date quickly because it is not designed to be changed on a regular basis.	It is most likely to be up to date as it changes automatically based on the source data.
The information can be viewed offline because live data is not required.	An internet or network connection to the source data is required, which can be costly and can also be slow in remote areas.
It is more likely to be accurate because time will have been taken to check the information being published, as it will be available for a long period of time.	The data may have been produced very quickly and so may contain errors.

Direct source

Data that is collected for the purpose for which it will be used

Data collected from a direct data source (primary source) must be used for the same purpose for which it has been collected

The data must not already exist for another purpose

When collecting data, the person collecting should know what purpose they intend to use the data

Indirect data source

Data that is collected for a different purpose (secondary source)

Data collected from an indirect source already existed for another purpose.

Direct data source	Indirect data source
The data will be relevant because what is needed has been collected.	Additional data that is not required will exist that may take time to sort through and some data that is required may not exist.
The original source is known and so can be trusted.	The original source may not be known and so it can’t be assumed that it is reliable.
It can take a long time to gather original data rather than use data that already exists.	The data is immediately available.
A large sample of statistical data can be difficult to collect for one-off purposes.	If statistical analysis is required, then there are more likely to be large samples available.
The data is likely to be up to date because it has been collected recently.	Data may be out of date because it was collected at a different time.
Bias can be eliminated by asking specific questions	Original data may be biased due to its source.
The data can be collected and presented in the format required.	The data is unlikely to be in the format required, which may make extracting the data difficult.

1.3 Quality of information

The factors that affect the quality of information are:

Accuracy
Relevance
Age
Level of detail
Completeness

1.4 Coding, encoding and encrypting data

Coding:

Representing data by assigning a code to it for classification or identification

There are a number of reasons for coding data.

Advantages and disadvantages of coding data (key points)

Advantages	Disadvantages
Presentation	Limited codes
Storage	Interpretation
Speed of input	Similarity
Processing	Efficiency
Validation	Missing information
Confidentiality
Consistency

Encoding:

Storing data in a specific format

Computers do not recognize text, sound and images in the same way we do. Computers use binary digits which are 1s and 0s. One means on and zero means off.

Codecs are programs that are used to encode data for images, audio and video. The codecs are also needed to read the data.

There several types of encoding

Text

Text is encoded as a number that is then represented by a binary number. A common encoding method is ASCII (American Standard Code for Information Interchange). Other encoding methods include Unicode and EBCDIC.

Images

Encoding is also used to store images. At the most basic level, images are encoded as bitmaps.

Sound

Sound is encoded is encoded by storing the sample rate, bit depth and bit rate
When sound is recorded, it is converted from analogue to digital format, which is broken down into thousands of samples per second.
The sample rate or frequency rate is the number of audio sample per second measured in hertz
The bit depth is the number of bits (1’s and 0’s) used for each sound clip
The bit rate is the number of bits processed every second, measured in kilobits per second
When sound is encoded in an uncompressed format it is saved as WAV (wave form audio file format)

There two ways of compressing and storing a sound file

Lossless compression: reduces the file size without losing any quality but can only reduce the file size to about 50%
Lossy compression: reduces the file size by reducing bit rate, causing some loss in quality.

Video

Video encoding requires storage of both images and sounds
Images are stored as frames. A standard quality video would normally have 24 frames per second (fps).
The higher the number of frames per second, the more storage that is required, but the higher quality the video will be.
The size of the image is also important. A HD video will have an image size of 1920 pixels wide and 1080 pixels high. The higher the image size, the more storage that is required.
The bit rate for videos combines both the audio and frames that need to be processed every second. The bit rate is the number of bits that need to be processed every second. A higher frame rate requires a higher bit rate.
Videos are usually stored as a lossy compression. A common lossy compression format is MP4, which is a codec created by MPEG (Moving Pictures Expert Group). Usually involves reducing the: resolution, image size or bit rate.
There are also lossless compression methods such as digital video (DV).

Encryption

One specific type of encoding is encryption.
This is when data is scrambled so that it cannot be understood.
The purpose of encryption is to make the data difficult or impossible to read if it is accessed by an unauthorized user.
Data can be encrypted when it is stored on disks or other storage media, or it can be encrypted when it is sent across a network such as a local area network or the internet.
Accessing encrypted data legitimately is known as decryption.

Caesar cipher

A cipher is a secret way of writing. In other words it is a code.
Ciphers are used to convert a message into an encrypted message. It is a special type of algorithm which defines the set of rules to follow to encrypt a message.
The Caesar cipher is sometimes known as a shift cipher because it selects replacement letters by shifting along the alphabet.

Symmetric encryption

This is the oldest method of encryption.
It requires both the sender and recipient to possess the secret encryption and decryption key.
With symmetric encryption, the secret key needs to be sent to the recipient.
This could be done at a separate time, but it still has to be transmitted whether by post or over the internet and it could be intercepted.

Asymmetric encryption

Asymmetric encryption is also known as public-key cryptography.
Asymmetric encryption overcomes the problem of symmetric encryption keys being intercepted by using a pair of keys.
This will include a public key which is available to anybody wanting to send data, and a private key that is known only to the recipient.
They key is the algorithm required to encrypt and decrypt the data.
This method requires a lot more processing than symmetric encryption and so it takes longer to decrypt the data.
In order to find a public key, digital certificates are required which identify the user or server and provide the public key. A digital certificate is unique to each user or server. A digital certificate usually includes: organization name, organization that issued the certificate, user’s email address, user’s country, user’s public key.
When encrypted data is required by a recipient, the computer will request the digital certificate from the sender. The public key can be found within the digital certificate.
Asymmetric encryption is used for Secure Sockets Layer (SSL) which is the security method used for secure websites. Transport Layer Security (TLS) has superseded SSL but they are both often referred to as SSL.

SSL & TSL

All the major web browsers currently in use support TLS.
TLS is the successor to SSL as SSL is being phased out TLS and SSL are protocols that provide security of communication in a network.
TLS/SSL are used in web browsing, email, Internet faxing, instant messaging and Voice over IP/VoIP
Client-server applications use TLS in a network to try to prevent eavesdropping.
Encryption protocols enable credit card payments to be made securely SSL/TLS requires a handshake to be carried out

1.5 Checking the accuracy of data

There are two methods for checking accuracy for data

Validation: the process of checking if data matches the acceptable rules
Verification: ensures data is entered into the original the system matches the original source

Validation

The main purpose of validation is to ensure data is sensible and conforms to the defined rules

There various different methods of validation these include:

Presence check: to make sure that data entered is present
Range check: ensures data is within a defined range, contains one boundary. The lower boundary and upper boundary.
Type check: ensures data must be of a defined data type.
Length check: ensures that data is of a defined length or within a range of lengths
Format check and check digit: ensures data matches a defined format
Lookup check: checks to see if data exists in a list
Consistency check: compares data in one field that already exists within a record, to check its consistency.
Limit check: ensures data is within a defined range, contains one boundary the highest possible value or the lowest possible value

Verification

Verification is generally the process of checking whether the data entered matches into the system (the system matches the original source).

There two methods that we use to verify data

Double data entry: Data is input into the system twice and is checked for consistency by comparing.
Visual verification: visually checking whether data entered to the system matches the original source, by reading and comparing (usually done by the user).

By using both validation and verification the chances of entering incorrect data are reduced

Proof reading

Proof reading is that process of checking information for spelling errors, grammar errors, formatting and accuracy

Data, information, knowledge, and processing

Backnotes