数据结构
数据格式
第一列(uid):用户 id
第二列(dt):登录时间
第三列(status):用户状态
数据示例
1,2021-05-11,1
1,2021-05-12,1
1,2021-05-13,1
1,2021-05-14,1
1,2021-05-15,1
1,2021-05-16,1
1,2021-05-17,1
1,2021-05-18,1
2,2021-05-11,1
2,2021-05-12,1
2,2021-05-13,0
2,2021-05-14,1
2,2021-05-15,1
2,2021-05-16,0
创建表
create table ulogin(
uid int, dt date, status int )
row format delimited fields terminated by ',' stored as textfile
location '/user/hive/warehouse/sca.db/ulogin/';
加载数据
load data local inpath '/home/hadoop/shell/data/userlogin.txt' into table ulogin;
具体实现
使用 row_number()函数在组内给每行数据编号
使用的 row_number 给日期排序
select uid, dt, row_number() over (partition by uid order by dt) gid from ulogin where
status=1;
具体实现
日期-rownum = gid,得到结果可以作为连续登录的依据
通过 date_sub 函数来获取连续登录的日期
select uid, dt, date_sub(dt, row_number() over (partition by uid order by dt)) gid from ulogin
where status=1;
具体实现
根据求得的 gid 作为分组条件,统计最终结果
通过 group by count having 分组统计,即可获取连续登入 7 日的用户
select uid,count(gid) from (select uid, dt, date_sub(dt, row_number() over (partition by uid
order by dt)) gid from ulogin where status=1) as u group by uid,gid having count(gid)>=7;