Veri Madenciligi – Data Mining SOLUTION HW1 Veri Onisleme

Transkript

Veri Madenciligi – Data Mining SOLUTION HW1 Veri Onisleme
Veri Madenciligi – Data Mining
SOLUTION HW1
Veri Onisleme – Data Preprocessing
Dr. Cengiz Orencik
October 11, 2015
1. Suppose that the data for analysis includes the attribute age. The age
values for the data are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21,
22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
Turkce: Elimizde analiz etmek istedigimiz verinin yas niteligini icerdigini
dusunelim. Bu yas verisi kucukten buyuge sral halde su sekilde olsun: 13,
15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35,
36, 40, 45, 46, 52, 70.
(a) What are the mean and median of the data?
Verinin ortalama ve ortanca (medyan) degerleri nedir?
number of items, veri eleman says (N = 27)
mean(ortalama) =
13 + 15 + 16 + 16 + 19 + 20 + 20 + . . . + 35 + 36 + 40 + 45 + 46 + 52 + 70
=
27
= 29.962
As there are 27 data items, median is the 14th one which is 25.
toplam 27 veri var, tam ortadaki 14uncu eleman olan 25
(b) What is the mode of the data? Comment on its modality (i.e., unimodal, trimodal, etc.).
Verinin mode degeri nedir? Hangi mode yapsnda oldugunu yorumlayin (unimodal, trimodal, vs.).
25 and 35 both occurs 4 times, the rest occurs 3 or less so both 25 and
35 are mode values. Having two different modes means it is bimodal
or generally multimodal.
Hem 25hem de 35 verinin icinde 4er defa geciyor. Diger veriler 3
veya daha az sayida geciyor. Bu yuzden en sk gecen elemanlar 25 ve
35 mod degerleridir. Iki mod degeri oldugu icin bimodal yada genel
olarak birden fazla mod degeri oldugu icin multimodal yapisindadir.
1
(c) What is the midrange of the data?
Verinin orta-aralik (midrange) degeri nedir?
= 41.5
avg(min, max) = 13+70
2
(d) Find (roughly) the first quartile (Q1) and the third quartile (Q3) of
the data.
Verinin (yaklask olarak) ilk ceyrek (Q1) ve ikinci ceyrek (Q3) degerleri
nedir?
As we have 27 elements, roughly we may set the first quartile as 7th
smallest and set the third quartile as the 7th largest (21st smallest)
elements. 6th and 8th elements are also fine, or you may take the
average of 6th and 7th. So the first quartile (Q1 ) is 20 and third
quartile (Q3 ) is 35.
27 elemani 4 parcaya bolmek istersek yaklask bastan 7 inci ve sondan
7 inci elemanlar eyreklikleri belirler. 6 inci veya 8 inci de kullanilsa
olur. Dolayisiyla Q1 degeri 20 ve Q3 degeri 35 olarak bulunur.
(e) Does the data contain any outlier values? Explain.
Veri sapan deger (outlier) iceriyor mu? Aciklayin.
first calculate 1.5 IQR which is 1.5 × (35 − 20) = 22.5
lower bound is Q1 − 1.5IQR = 20 − 22.5 = −2.5
upper bound is Q3 + 1.5IQR = 35 + 22.5 = 57.5
Any value smaller than -2.5 or larger than 57.5 is an outlier. So 70
is the only outlier value in the data.
Once 1.5 IQR hesaplanir: 1.5 × (35 − 20) = 22.5
alt limit Q1 − 1.5IQR = 20 − 22.5 = −2.5
ust limit Q3 + 1.5IQR = 35 + 22.5 = 57.5
-2.5ten kucuk veya 57.5ten buyuk degerler sapan veri, dolayisiyla
verimizdeki tek sapan deger 70.
(f) Show a boxplot of the data.
Veriyi kutu grafigi (boxplot) olarak ifade ediniz.
5 value representation min(Q1 −1.5IQR), max(Q3 +1.5IQR), Q1 , M edian, Q3
1.2
1
0.8
13
10
57.5
20
30
40
50
60
(g) What is the standart deviation (σ) of the data?
Verinin standart sapmasini (σ) bulun.
σ2 =
standartdeviation(σ) =
P
i (xi
√
− 29.96)2
= 161.3
27
161.3 = 12.7
2
2. Suppose a hospital tested the age and body fat data for 18 randomly
selected adults with the following results:
Bir hastanede 18 rastgele secilmis yetiskin uzerinde yapilan testte yas ve
yag oranlar uzerine asagidaki sonuc alinmistir:
age
% fat
age
% fat
23
9.5
52
34.6
23
26.5
54
42.5
27
7.8
54
28.8
27
17.8
56
33.4
39
31.4
57
30.2
41
25.9
58
34.1
47
27.4
58
32.9
49
27.2
60
41.2
50
31.2
61
35.7
(a) Draw the boxplots for age and % fat (separately).
yas ve yag oran degerlerini iki ayri kutu grafigi (boxplot) olarak ifade
ediniz.
age is already sorted need to sort fat (yas zaten sirali, yag oranlarda
siralanmali
sorted fat {7.8, 9.5, 17.8, 25.9, 26.5, 27.2, 27.4, 28.8, 30.2,
31.2, 31.4, 32.9, 33.4, 34.1, 34.6, 35.7, 41.2, 42.5}
medianage = avg(50, 52) = 51
medianf at = avg(30.3, 31.2) = 30.75
(yaklask bastan ve sondan (roughly) 5th element)
Q1 age = 39 Q1 f at = 26.5 Q3 age = 57 Q3 f at = 34.1
No outlier for age but for fat rate there are outliers: 1.5IQR = (34.1
- 26.5) * 1.5 = 11.4
lower bound is 26.5 − 11.4 = 15.1. Anything lower is an outlier.
AGE
1.2
61
1 23
0.8
20
30
40
50
60
FAT RATE
1.2
1 15.1
0.8
15
20
42.5
25
30
35
40
45
(b) Calculate the correlation coefficient r. Are these two attributes positively or negatively correlated?
Korelasyon katsays r’yi hesaplayin. Bu iki nitelik pozitif mi yoksa
negatif yonde mi birbirleriyle alakaldr?
P
(agei − meanage )(f ati − meanf at )
rage,f at = i
N σage σf at
meanage = 46.44, meanf at = 28.78 σage = 12.84 σf at = 8.99 N = 18
1700.33
= 0.818
18 × 12.84 × 8.99
3
As 0.818 is positive and very close to 1, we can claim age and fat rate
are positively correlated in a strong way.
0.818 sifirdan buyuk ve 1 e cok yakin oldugu icin kuvvetli bir sekilde
pozitif korelasyon vardir.
4

Benzer belgeler

UDK 621 .713.14 :744 .4 TURK STANDARDI TS B

UDK 621 .713.14 :744 .4 TURK STANDARDI TS B tolerans . sekil ve konumlar igin genel toleranslarin tasarrminin tam uygulamasina imkan verir. Fonksiyonun gene) toleranslardan daha buyuk bir teloransa izin verdigi ve daha buyuk toleransla imala...

Detaylı

Doç.Dr. Şevket Dönmez

Doç.Dr. Şevket Dönmez n the major excavations of last 20 years like Old Sultanahmet Prison, Yenikapı and Sirkeci in the boundaries of the Historical Peninsula, no new discoveries was obtained regarding the speculation o...

Detaylı