
Category Archives: BigData
Ubuntu18搭建CDH6环境03
1、确保cdt01可以ssh联通cdt02和cdt03
#这个userid与可以无密码使用sudo的userid相同 ssh -l userid cdh02 ssh -l userid cdh03
2、浏览器访问(以后都是界面了)
http://172.16.172.101:7180
用户名:admin
密码:admin
3、根据引导界面,新建Cluster
将172.16.172.101-172.16.172.103都安装好cloudera-manager-agent
4、根据引导界面,选用需要的软件进行安装
安装时,注意合理分配角色,也就是合理分配内存资源
5、依次安装
hdfs
zookeeper
hbase
yarn
hive
spark
6、安装完毕
PS:
1、如果出现找不到jdbc driver的情况
sudo apt-get install libmysql-java
Ubuntu18搭建CDH6环境02
1、cdt01安装
#添加cloudera仓库 wget https://archive.cloudera.com/cm6/6.3.0/ubuntu1804/apt/archive.key sudo apt-key add archive.key wget https://archive.cloudera.com/cm6/6.3.0/ubuntu1804/apt/cloudera-manager.list sudo mv cloudera-manager.list /etc/apt/sources.list.d/ #更新软件清单 sudo apt-get update #安装jdk8 sudo apt-get install openjdk-8-jdk #安装cloudera sudo apt-get install cloudera-manager-daemons cloudera-manager-agent cloudera-manager-server
2、安装及配置mysql
2.1、安装mysql
sudo apt-get install mysql-server mysql-client libmysqlclient-dev libmysql-java
2.2、停止mysql
sudo service mysql stop
2.3、删除不需要的文件
sudo rm /var/lib/mysql/ib_logfile0 sudo rm /var/lib/mysql/ib_logfile1
2.4、修改配置文件
sudo vi /etc/mysql/mysql.conf.d/mysqld.cnf #修改或添加以下信息 [mysqld] transaction-isolation = READ-COMMITTED max_allowed_packet = 32M max_connections = 300 innodb_flush_method = O_DIRECT
2.5、启动mysql
sudo service mysql start
2.6、初始化mysql
sudo mysql_secure_installation
3、创建数据库并授权
sudo mysql -uroot -p
-- 创建数据库 -- Cloudera Manager Server CREATE DATABASE scm DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; -- Activity Monitor CREATE DATABASE amon DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; -- Reports Manager CREATE DATABASE rman DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; -- Hue CREATE DATABASE hue DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; -- Hive Metastore Server CREATE DATABASE hive DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; -- Sentry Server CREATE DATABASE sentry DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; -- Cloudera Navigator Audit Server CREATE DATABASE nav DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; -- Cloudera Navigator Metadata Server CREATE DATABASE navms DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; -- Oozie CREATE DATABASE oozie DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci; #创建用户并授权 GRANT ALL ON scm.* TO 'scm'@'%' IDENTIFIED BY 'scm123456'; GRANT ALL ON amon.* TO 'amon'@'%' IDENTIFIED BY 'amon123456'; GRANT ALL ON rman.* TO 'rman'@'%' IDENTIFIED BY 'rman123456'; GRANT ALL ON hue.* TO 'hue'@'%' IDENTIFIED BY 'hue123456'; GRANT ALL ON hive.* TO 'hive'@'%' IDENTIFIED BY 'hive123456'; GRANT ALL ON sentry.* TO 'sentry'@'%' IDENTIFIED BY 'sentry123456'; GRANT ALL ON nav.* TO 'nav'@'%' IDENTIFIED BY 'nav123456'; GRANT ALL ON navms.* TO 'navms'@'%' IDENTIFIED BY 'navms123456'; GRANT ALL ON oozie.* TO 'oozie'@'%' IDENTIFIED BY 'oozie123456';
4、初始化数据库
sudo /opt/cloudera/cm/schema/scm_prepare_database.sh mysql scm scm scm123456 sudo /opt/cloudera/cm/schema/scm_prepare_database.sh mysql amon amon amon123456 sudo /opt/cloudera/cm/schema/scm_prepare_database.sh mysql rman rman rman123456 sudo /opt/cloudera/cm/schema/scm_prepare_database.sh mysql hue hue hue123456 sudo /opt/cloudera/cm/schema/scm_prepare_database.sh mysql hive hive hive123456 sudo /opt/cloudera/cm/schema/scm_prepare_database.sh mysql sentry sentry sentry123456 sudo /opt/cloudera/cm/schema/scm_prepare_database.sh mysql nav nav nav123456 sudo /opt/cloudera/cm/schema/scm_prepare_database.sh mysql navms navms navms123456 sudo /opt/cloudera/cm/schema/scm_prepare_database.sh mysql oozie oozie oozie123456
5、启动
#启动cloudera-scm-server sudo systemctl start cloudera-scm-server #查看启动日志,等待Jetty启动完成 sudo tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log
6、启动
浏览器访问
http://172.16.172.101:7180
用户名:admin
密码:admin
Ubuntu18搭建CDH6环境01
1、环境准备
VirtualBox 6 Ubuntu 18 Cloudera CDH 6.3
2、虚拟机安装Ubuntu18,配置为
1CPU
4G内存
300G硬盘
两块网卡,一块为HostOnly,一块为NAT
3、将虚拟机克隆为三份
如果是手工拷贝,记得修改硬盘UUID、虚拟机UUID、网卡硬件ID
4、设置IP地址、hostname及hosts文件
| 机器名 | HostOnly IP |
| cdh01 | 172.16.172.101 |
| cdh02 | 172.16.172.102 |
| cdh03 | 172.16.172.103 |
5、允许无密码使用sudo,至少修改cdh02和cdh03
#edit /etc/sudoers userid ALL=(ALL:ALL) NOPASSWD: ALL
Redash环境搭建(Ubuntu)
1、下载安装脚本
wget -O bootstrap.sh https://raw.githubusercontent.com/getredash/redash/master/setup/ubuntu/bootstrap.sh
2、运行脚本
chmod +x bootstrap.sh sudo ./bootstrap.sh
3、脚本执行成功后,直接访问nginx就好了
http://ip:80
实际上是代理了这个网站
http://localhost:5000
4、常见问题
在执行过程中,遇到下载失败的情况,就直接把文件下载到本地,改一下路径,重新运行脚本就好了
我在运行脚本的过程中,遇到了缺少schema的提示,删除了数据库redash及用户redash,重新运行脚本就好了
Metabase环境搭建(Docker)
1、第一次运行
docker run -d -p 3000:3000 --name metabase metabase/metabase
2、以后运行
docker start metabase
3、访问http://ip:3000就好了
Superset环境搭建(Ubuntu)
1、安装依赖包
sudo apt-get install build-essential libssl-dev libffi-dev python-dev python-pip libsasl2-dev libldap2-dev pip install virtualenv
2、使用虚拟环境
#新建沙盒 virtualenv supersetenv #进入沙盒 source bin/activate
3、安装
#升级安装工具,安装服务 pip install --upgrade setuptools pip pip install superset #新增管理员用户 fabmanager create-admin --app superset #重置管理员密码 #fabmanager reset-password admin --app superset #升级数据库 superset db upgrade #加载测试数据 superset load_examples #初始化 superset init #运行Server superset runserver
4、此时,只需要访问http://ip:8088就可以登陆了
5、安装驱动
#mysql apt-get install libmysqlclient-dev pip install mysqlclient #oracle #pip install cx_Oracle #mssql #pip install pymssql
6、关闭
#Ctrl+C关闭Server #退出沙盒 deactivate #删除沙盒 #rmvirtualenv supersetenv
7、升级
#进入沙盒 source bin/activate #升级Server pip install superset --upgrade #升级DB superset db upgrade #初始化 superset init #退出沙盒 deactivate
Spark环境搭建05
这里主要说一下Spark的SQL操作,如何从mysql导入数据。
#拷贝了mysql的驱动到jar文件下面
#建立DF
val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3307").option("dbtable", "hive.TBLS").option("user", "hive").option("password", "hive").load()
#schema
jdbcDF.schema
#count
jdbcDF.count()
#show
jdbcDF.show()
Spark环境搭建04
这里主要说一下Spark的SQL操作。
1、dataframe操作数据
#加载json数据
val df = spark.read.json("/usr/hadoop/person.json")
#加载CSV数据
#val df = spark.read.csv("/usr/hadoop/person.csv")
#查询前20行
df.show()
#查看结构
df.printSchema()
#选择一列
df.select("NAME").show()
#按条件过滤行
df.filter($"BALANCE_COST" < 10 && $"BALANCE_COST" > 1).show()
#分组统计
df.groupBy("SEX_CODE").count().show()
2、sql操作数据
#创建视图
df.createOrReplaceTempView("person")
#查看数据
spark.sql("SELECT * FROM person").show()
#统计数据
spark.sql("SELECT * FROM person").count()
#带条件选择
spark.sql("SELECT * FROM person WHERE BALANCE_COST<10 and BALANCE_COST>1 order by BALANCE_COST").show()
3、转为DS
#转为Dataset
case class PERSONC(PATIENT_NO : String,NAME : String,SEX_CODE : String,BIRTHDATE : String,BALANCE_CODE : String)
var personDS = spark.read.json("/usr/hadoop/person.json").as[PERSONC]
4、sql与map reduce混用
personDS.select("BALANCE_COST").map(row=>if(row(0)==null) 0.0 else (row(0)+"").toDouble).reduce((a,b)=>if(a>b) a else b)
spark.sql("select BALANCE_COST from person").map(row=>if(row(0)==null) 0.0 else (row(0)+"").toDouble).reduce((a,b)=>if(a>b) a else b)
5、数据映射为Class对象
val personRDD = spark.sparkContext.textFile("/usr/hadoop/person.txt")
val persons = personRDD.map(_.split(",")).map(attributes => Person(attributes(0), attributes(1), attributes(2), attributes(3), attributes(4).toDouble))
6、自定义schema
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
#加载数据
val personRDD = spark.sparkContext.textFile("/usr/hadoop/person.txt")
#转为org.apache.spark.sql.Row
val rowRDD = personRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1), attributes(2), attributes(3), attributes(4).replace("\"","").toDouble))
#定义新的Schema
val personSchema = StructType(List(StructField("PatientNum",StringType,nullable = true), StructField("Name",StringType,nullable = true), StructField("SexCode",StringType,nullable = true), StructField("BirthDate",StringType,nullable = true), StructField("BalanceCode",DoubleType,nullable = true)))
#建立新的DF
val personDF = spark.createDataFrame(rowRDD, personSchema)
#使用DF
personDF.select("PatientNum").show()
Spark环境搭建03
上面说到了Spark如何与Hadoop整合,下面就说一下Spark如何与HBase整合。
1、获取hbase的classpath
#要把netty和jetty的包去掉,否则会有jar包冲突 HBASE_PATH=`/home/hadoop/Deploy/hbase-1.1.2/bin/hbase classpath`
2、启动spark
bin/spark-shell --driver-class-path $HBASE_PATH
3、进行简单的操作
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "inpatient_hb")
val admin = new HBaseAdmin(conf)
admin.isTableAvailable("inpatient_hb")
res1: Boolean = true
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
hBaseRDD.count()
2017-01-03 20:46:29,854 INFO [main] scheduler.DAGScheduler (Logging.scala:logInfo(58)) - Job 0 finished: count at <console>:36, took 23.170739 s
res2: Long = 115077