Commands Cheat Sheet

Evaluating engineering tools? Get the comparison in Google Sheets

(Perfect for making buy/build decisions or internal reviews.)

Most-used commands
Your email is safe thing.

Thankyou for your submission

We have sent the cheatsheet on your email!
Oops! Something went wrong while submitting the form.

Connection

Start Spark shell
spark-shell

Start PySpark shell
pyspark

Start Spark with specific configuration
spark-shell --conf spark.executor.memory=2g

Submit a Spark application
spark-submit --class org.example.MyApp --master yarn app.jar

Start Spark with specified master
spark-shell --master spark://host:port

Monitoring

View Spark UI
http://localhost:4040 (or cluster-specific URL)

View Spark History Server
http://spark-history-server:18080

Monitor active applications
yarn application -list (when using YARN)

View Spark logs
yarn logs -applicationId (when using YARN)

Check Spark driver logs
cat /var/log/spark/spark-driver.log (location varies by deployment)

Performance Tuning

Set number of executors
--num-executors

Set executor memory
--executor-memory

Set executor cores
--executor-cores

Set driver memory
--driver-memory

Enable dynamic allocation
--conf spark.dynamicAllocation.enabled=true

Runtime Metrics

Enable metrics collection
--conf spark.metrics.conf=

Log metrics to Graphite
--conf spark.metrics.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink

Connect to JMX
--conf spark.metrics.conf=metrics.properties (with JMX settings)

Enable event logging
--conf spark.eventLog.enabled=true

Set event log directory
--conf spark.eventLog.dir=

SQL Commands

Create a temporary view
dataFrame.createOrReplaceTempView("viewName")

Run a SQL query
spark.sql("SELECT * FROM viewName")

Show tables
spark.sql("SHOW TABLES").show()

Describe table
spark.sql("DESCRIBE viewName").show()

Cache table
spark.sql("CACHE TABLE viewName").show()

Debugging

Set log level
sc.setLogLevel("ERROR")

Print DAG visualization
df.explain()

Get execution plan
spark.sql("EXPLAIN SELECT * FROM viewName").show()

Count records in RDD/DataFrame
df.count()

Show DataFrame schema
df.printSchema()